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Message from the Program Chair 


It is with great pleasure that I welcome you to the 17th USENIX Security Symposium, in San Jose, California. A 
total of 174 research papers were submitted to the technical program. Four were withdrawn or otherwise summar- 
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and we hope that the detailed technical review comments we provided are helpful. I thank all authors who submit- 
ted papers. High-quality submissions are the starting point for a superb symposium. 


Our Program Committee (PC) meeting was graciously hosted April 3—4 by Angelos Keromytis at Columbia Uni- 
versity, with help from Sophie Majewski. The dinner provided afterwards by USENIX, at Pisticci Restaurant, was 
most welcome. As expected, the meeting was attended by essentially the entire committee—25 of 26 members. 
Each paper received at least three written reviews, and many were read and discussed by a considerably larger 
group. PC members were restricted to being co-authors on at most two submissions. The physical-presence PC 
meeting and the relatively small Program Committee, as is traditional for USENIX Security (in sharp contrast to 
several other major security research conferences), contributed to a collegial process and open discussions, pooling 
the expertise of the entire committee. Iam immensely grateful to the committee for their cooperative spirit and ex- 
traordinary efforts. Every member delivered every review requested, and more. It was a true privilege to work with 
such a dedicated and focused team, many of whom seem to serve continuous tours of Program Committee duty. I 
extend my thanks to all of the external reviewers relied upon by the PC members; their names are recorded in the 
frontmatter of these proceedings. 


Beyond the technical program in these proceedings, the symposium is enriched by many other items. These 
include two days of tutorials by area experts (prior to the technical program) and, in parallel with the submitted pa- 
pers, an exceptional invited talks track. For the latter, thanks are due to our invited talks committee of Bill Aiello, 
Angelos Keromytis, and Avi Rubin. This year’s keynote address is by Debra Bowen, California Secretary of State. 
I thank Carrie Gates for organizing the poster session and Hao Chen for chairing the work-in-progress reports. 


Rather than repeat past praises about the USENIX staff, let me give the following advice to all potential future 
program chairs: if given a choice between chairing a USENIX conference and another, choose the former. You 
will truly understand why only once you have done both. I am happy to thank Anne Dickison for driving public- 
ity, Jane-Ellen Long for logistics related to the proceedings, Devon Shaw for support related to the PC meeting and 
the conference itself, Casey Henderson for updates to the USENIX Web site, and Peter Collinson for manning the 
submissions system and review Web site. Did I mention Ellie Young? Who is it that pulls all the strings and holds 
everything together? Thanks, Ellie: it is a pleasure to work with you. Thanks also to Niels Provos, last year’s Pro- 
gram Chair, for guidance, and to Matt Blaze for talking me into acting as this year’s Program Chair. 


I hope you enjoy the symposium as much as I have enjoyed being part of delivering it. 


Paul Van Oorschot, Carleton University 
Program Chair 
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Abstract 


As the web continues to play an ever increasing role 
in information exchange, so too is it becoming the pre- 
vailing platform for infecting vulnerable hosts. In this 
paper, we provide a detailed study of the pervasiveness 
of so-called drive-by downloads on the Internet. Drive- 
by downloads are caused by URLs that attempt to exploit 
their visitors and cause malware to be installed and run 
automatically. Over a period of 10 months we processed 
billions of URLs, and our results shows that a non-trivial 
amount, of over 3 million malicious URLs, initiate drive- 
by downloads. An even more troubling finding is that 
approximately 1.3% of the incoming search queries to 
Google’s search engine returned at least one URL labeled 
as malicious in the results page. We also explore sev- 
eral aspects of the drive-by downloads problem. Specifi- 
cally, we study the relationship between the user brows- 
ing habits and exposure to malware, the techniques used 
to lure the user into the malware distribution networks, 
and the different properties of these networks. 


1 Introduction 


It should come as no surprise that our increasing reliance 
on the Internet for many facets of our daily lives (e.g., 
commerce, communication, entertainment, etc.) makes 
the Internet an attractive target for a host of illicit ac- 
tivities. Indeed, over the past several years, Internet ser- 
vices have witnessed major disruptions from attacks, and 
the network itself is continually plagued with malfea- 
sance [14]. While the monetary gains from the myriad 
of illicit behaviors being perpetrated today (e.g., phish- 
ing, spam) is just barely being understood [11], it is clear 
that there is a general shift in tactics—wide-scale attacks 
aimed at overwhelming computing resources are becom- 
ing less prevalent, and instead, traditional scanning at- 
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tacks are being replaced by other mechanisms. Chief 
among these is the exploitation of the web, and the ser- 
vices built upon it, to distribute malware. 


This change in the playing field is particularly alarm- 
ing, because unlike traditional scanning attacks that use 
push-based infection to increase their population, web- 
based malware infection follows a pull-based model. For 
the most part, the techniques in use today for deliver- 
ing web-malware can be divided into two main cate- 
gories. In the first case, attackers use various social en- 
gineering techniques to entice the visitors of a website 
to download and run malware. The second, more de- 
vious case, involves the underhanded tactic of targeting 
various browser vulnerabilities to automatically down- 
load and run—i.e., unknowingly to the visitor—the bi- 
nary upon visiting a website. When popular websites 
are exploited, the potential victim base from these so- 
called drive-by downloads can be far greater than other 
forms of exploitation because traditional defenses (e.g., 
firewalls, dynamic addressing, proxies) pose no barrier 
to infection. While social engineering may, in general, 
be an important malware spreading vector, in this work 
we restrict our focus and analysis to malware delivered 
via drive-by downloads. 


Recently, Provos et al. [20] provided insights on this 
new phenomenon, and presented a cursory overview of 
web-based malware. Specifically, they described a num- 
ber of server- and client-side exploitation techniques that 
are used to spread malware, and elucidated the mecha- 
nisms by which a successful exploitation chain can start 
and continue to the automatic installation of malware. In 
this paper, we present a detailed analysis of the malware 
serving infrastructure on the web using a large corpus of 
malicious URLs collected over a period of ten months. 
Using this data, we estimate the global prevalence of 
drive-by downloads, and identify several trends for dif- 
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ferent aspects of the web malware problem. Our results 
reveal an alarming contribution of Chinese-based web 
sites to the web malware problem: overall, 67% of the 
malware distribution servers and 64% of the web sites 
that link to them are located in China. These results raise 
serious question about the security practices employed 
by web site administrators. 


Additionally, we study several properties of the mal- 
ware serving infrastructure, and show that (for the most 
part) the malware serving networks are composed of 
tree-like structures with strong fan-in edges leading to 
the main malware distribution sites. These distribution 
sites normally deliver the malware to the victim after a 
number of indirection steps traversing a path on the dis- 
tribution network tree. More interestingly, we show that 
several malware distribution networks have linkages that 
can be attributed to various relationships. 


In general, the edges of these malware distribution 
networks represent the hop-points used to lure users to 
the malware distribution site. By investigating these 
edges, we reveal a number of causal relationships that 
eventually lead to browser exploitation. More troubling, 
we show that drive-by downloads are being induced by 
mechanisms beyond the conventional techniques of con- 
trolling the content of compromised websites. In par- 
ticular, our results reveal that Ad serving networks are 
increasingly being used as hops in the malware serving 
chain. We attribute this increase to syndication, a com- 
mon practice which allows advertisers to rent out part of 
their advertising space to other parties. These findings 
are problematic as they show that even protected web- 
servers can be used as vehicles for transferring malware. 
Additionally, we also show that contrary to common wis- 
dom, the practice of following “safe browsing” habits 
(i.e., avoiding gray content) by itself is not an effective 
safeguard against exploitation. 


The remainder of this paper is organized as follows. 
In Section 2, we provide background information on how 
vulnerable computer systems can be compromised solely 
by visiting a malicious web page. Section 3 gives an 
overview of our data collection infrastructure and in Sec- 
tion 4 we discuss the prevalence of malicious web sites 
on the Internet. In Section 5, we explore the mecha- 
nisms used to inject malicious content into web pages. 
We analyze several aspects of the web malware distribu- 
tion networks in Section 6. In Section 7 we provide an 
overview of the impact of the installed malware on the 
infected system. Section 8 discusses implications of our 
results and Section 9 presents related work. Finally, we 
conclude in Section 10. 
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2 Background 


Unfortunately, there are a number of existing exploita- 
tion strategies for installing malware on a user’s com- 
puter. One common technique for doing so is by re- 
motely exploiting vulnerable network services. How- 
ever, lately, this attack strategy has become less suc- 
cessful (and presumably, less profitable). Arguably, the 
proliferation of technologies such as Network Address 
Translators (NATs) and firewalls make it difficult to re- 
motely connect and exploit services running on users’ 
computers. This, in turn, has lead attackers to seek other 
avenues of exploitation. An equally potent alternative is 
to simply lure web users to connect to (compromised) 
malicious servers that subsequently deliver exploits tar- 
geting vulnerabilities of web browsers or their plugins. 

Adversaries use a number of techniques to inject con- 
tent under their control into benign websites. In many 
cases, adversaries exploit web servers via vulnerable 
scripting applications. Typically, these vulnerabilities 
(e.g., in phpBB2 or InvisionBoard) allow an adversary 
to gain direct access to the underlying operating sys- 
tem. That access can often be escalated to super-user 
privileges which in turn can be used to compromise any 
web server running on the compromised host. In general, 
upon successful exploitation of a web server the adver- 
sary injects new content to the compromised website. In 
most cases, the injected content is a link that redirects 
the visitors of these websites to a URL that hosts a script 
crafted to exploit the browser. To avoid visual detection 
by website owners, adversaries normally use invisible 
HTML components (e.g., zero pixel IFRAMEs) to hide 
the injected content. 

Another common content injection technique is to use 
websites that allow users to contribute their own con- 
tent, for example, via postings to forums or blogs. De- 
pending on the site’s configuration, user contributed con- 
tent may be restricted to text but often can also contain 
HTML such as links to images or other external content. 
This is particularly dangerous, as without proper filter- 
ing in place, the adversary can simply inject the exploit 
URL without the need to compromise the web server. 

Figure | illustrates the main phases in a typical in- 
teraction that takes place when a user visits a web- 
site with injected malicious content. Upon visiting this 
website, the browser downloads the initial exploit script 
(e.g., Viaan IFRAME). The exploit script (in most cases, 
javascript) targets a vulnerability in the browser or 
one of its plugins. Interested readers are referred to 
Provos et al. [20] for a number of vulnerabilities that 
are commonly used to gain control of the infected sys- 
tem. Successful exploitation of one of these vulnera- 
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Figure 1: A typical Interaction with of drive-by down- 
load victim with a landing URL . 


bilities results in the automatic execution of the exploit 
code, thereby triggering a drive-by download. Drive-by 
downloads start when the exploit instructs the browser to 
connect to a malware distribution site to retrieve malware 
executable(s). The downloaded executable is then auto- 
matically installed and started on the infected system!. 

Finally, attackers use a number of techniques to evade 
detection and complicate forensic analysis. For example, 
the use of randomly seeded obfuscated javascript in 
their exploit code is not uncommon. Moreover, to com- 
plicate network based detection attackers use a number 
or redirection steps before the browser eventually con- 
tacts the malware distribution site. 


3 Infrastructure and Methodology 


Our primary objective is to identify malicious web sites 
(i.e., URLs that trigger drive-by downloads) and help 
improve the safety of the Internet. Before proceeding 
further with the details of our data collection methodol- 
ogy, we first define some terms we use throughout this 
paper. We use the terms /anding pages and malicious 
URLs interchangeably to denote the URLs that initiate 
drive-by downloads when users visit them. In our subse- 
quent analysis, we group these URLs according to their 
top level domain names and we refer to the resulting set 
as the /anding sites. In many cases, the malicious pay- 
load is not hosted on the landing site, but instead loaded 
via an IFRAME or a SCRIPT from a remote site. We 
call the remote site that hosts malicious payloads a dis- 
tribution site. In what follows, we detail the different 
components of our data collection infrastructure. 
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Pre-processing Phase. As Figure 2 illustrates, the data 
processing starts from a large web repository maintained 
by Google. Our goal is to inspect URLs from this repos- 
itory and identify the ones that trigger drive-by down- 
loads. However, exhaustive inspection of each URL in 
the repository is prohibitively expensive due to the large 
number of URLs in the repository (on the order of bil- 
lions). Therefore, we first use light-weight techniques to 
extract URLs that are likely malicious then subject them 
to a more detailed analysis and verification phase. 
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Figure 2: URL selection and verification workflow. 


We employ the mapreduce [9] framework to process 
billions of web pages in parallel. For each web page, we 
extract several features, some of which take advantage of 
the fact that many landing URLs are hijacked to include 
malicious payload(s) or to point to malicious payload(s) 
from a distribution site. For example, we use “out of 
place” IFRAMEs, obfuscated JavaScript, or IFRAMEs to 
known distribution sites as features. Using a specialized 
machine-learning framework [7], we translate these fea- 
tures into a likelihood score. We employ five-fold cross- 
validation to measure the quality of the machine-learning 
framework. The cross-validation operates by splitting 
the data set into 5 randomly chosen partitions and then 
training on four partitions while using the remaining par- 
tition for validation. This process is repeated five times. 
For each trained model, we create an ROC curve and use 
the average ROC curve to estimate the overall accuracy. 
Using this ROC curve, we estimate the false positive and 
detection rate for different thresholds. Our infrastructure 
pre-processes roughly one billion pages daily. In order to 
fully utilize the capacity of the subsequent detailed ver- 
ification phase, we choose a threshold score that results 
in an outcome false positive rate of about 10~° with a 
corresponding detection rate of approximately 0.9. This 
amounts to about one million URLs that we subject to 
the computationally more expensive verification phase. 
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In addition to analyzing web pages in the crawled web 
repository, we also regularly select several hundred thou- 
sands URLs for in-depth verification. These URLs are 
randomly sampled from popular URLs as well as from 
the global index. We also process URLs reported by 
users. 


Verification Phase. This phase aims to verify whether 
a candidate URL from the pre-processing phase is ma- 
licious (i.e., initiates a drive-by download). To do that, 
we developed a large scale web-honeynet that simultane- 
ously runs a large number of Microsoft Windows images 
in virtual machines. Our system design draws on the ex- 
perience from earlier work [25], and includes unique fea- 
tures that are specific to our goals. In what follows we 
discuss the details of the URL verification process. 

Each honeypot instance runs an unpatched version of 
Internet Explorer. To inspect a candidate URL , the sys- 
tem first loads a clean Windows image then automati- 
cally starts the browser and instructs it to visit the candi- 
date URL . We detect malicious URLs using a combina- 
tion of execution based heuristics and results from anti- 
virus engines. Specifically, for each visited URL we run 
the virtual machine for approximately two minutes and 
monitor the system behavior for abnormal state changes 
including file system changes, newly created processes 
and changes to the system’s registry. Additionally, we 
subject the HTTP responses to virus scans using multi- 
ple anti-virus engines. To detect malicious URLs , we de- 
velop scoring heuristics used to determines the likelihood 
that a URL is malicious. We determine a URL score based 
on a combined measure of the different state changes 
resulting from visiting the URL . Our heuristics score 
URLs based on the number of created processes, the 
number of observed registry changes and the number of 
file system changes resulting from visiting the URL . 

To limit false positives, we choose a conservative de- 
cision criteria that uses an empirically derived thresh- 
old to mark a URL as malicious. This threshold is set 
such that it will be met if we detect changes in the sys- 
tem state, including the file system as well as creation 
of new processes. A visited URL is marked as malicious 
if it meets the threshold and one of the incoming HTTP 
responses is marked as malicious by at least one anti- 
virus scanner. Our extensive evaluation shows that this 
criteria introduces negligible false positives. Finally, a 
URL that meets the threshold requirement but has no in- 
coming payload flagged by any of the anti-virus engines, 
is marked as suspicious. 

On average, the detailed verification stage processes 
about one million URLs daily, of which roughly 25, 000 
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new URLs are flagged as malicious. The verification sys- 
tem records all the network interactions as well as the 
state changes. In what follows, we describe how we pro- 
cess the network traces associated with the detected ma- 
licious URLs to shed light on the malware distribution 
infrastructure. 


Constructing the Malware Distribution Networks. 
To understand the properties of the web malware serving 
infrastructure on the Internet, we analyze the recorded 
network traces associated with the detected malicious 
URLs to construct the malware distribution networks. 
We define a distribution network as the set of malware 
delivery trees from all the landing sites that lead to a par- 
ticular malware distribution site. A malware delivery tree 
consists of the landing site, as the leaf node, and all nodes 
(i.e., web sites) that the browser visits until it contacts the 
malware distribution site (the root of the tree). To con- 
struct the delivery trees we extract the edges connecting 
these nodes by inspecting the Referer header from the 
recorded successive HTTP requests the browser makes 
after visiting the landing page. However, in many cases 
the Referer headers are not sufficient to extract the 
full chain. For example, when the browser redirection 
results from an external script the Referrer, in this 
case, points to the base page and not the external script 
file. Additionally, in many cases the Referer header is 
not set (e.g., because the requests are made from within 
a browser plugin or newly-downloaded malware). 

To connect the missing causality links, we interpret the 
HTML and JavaScript content of the pages fetched by the 
browser and extract all the URLs from the fetched pages. 
Then, to identify causal edges we look for any URLs that 
match any of the HTTP fetches that were subsequently 
visited by the browser. In some cases, URLs contain 
randomly generated strings, so some requests cannot be 
matched exactly. In these cases, we apply heuristics 
based on edit distance to identify the most probable par- 
ent of the URL . Finally, for each malware distribution 
site, we construct its associated distribution network by 
combining the different malware delivery trees from all 
landing pages that lead to that site. 


Our infrastructure has been live for more than one 
year, continuously monitoring the web and detecting ma- 
licious URLs. In what follows, we report our findings 
based on analyzing data collected during that time pe- 
riod. Again, recall that we focus here on the perva- 
siveness of malicious activity (perpetrated by drive-by 
downloads) that is induced simply by visiting a landing 
page, thereafter requiring no additional interaction on the 
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client’s part (e.g., clicking on embedded links). Finally, 
we note that due to the large scale of our data collection 
and some infrastructural constraints, a number longitu- 
dinal aspects of the web malware problem (e.g., the life- 
time of the different malware distribution networks) are 
beyond the scope of this paper and are a subject of our 
future investigation. 


4 Prevalence of Drive-by Downloads 


We provide an estimate of the prevalence of web- 
malware based on data collected over a period of ten 
months (Jan 2007 - Oct 2007). During that period, we 
subjected over 60 million URLs for in-depth processing 
through our verification system. Overall, we detected 
more than 3 million malicious URLs hosted on more than 
180 thousand landing sites. Overall, we observed more 
than 9 thousand different distribution sites. The findings 
are summarized in Table 1. Overall, these results show 
the scope of the problem, but do not necessarily reflect 
the exposure of end-users to drive-by downloads. In what 
follows, we attempt to address this question by estimat- 
ing the overall impact of the malicious web sites. 


Jan - Oct 2007 
66, 534, 330 
3, 385, 889 

3, 417, 590 
181, 699 

9, 340 


Data collection period 
Total URLs checked in-depth 
Unique suspicious landing URLs 


Unique malicious landing URLs 
Unique malicious landing sites 
Unique distribution sites 





Table 1: Summary of collected data. 


To study the potential impact of malicious web sites 
on the end-users, we first examine the fraction of incom- 
ing search queries to Google’s search engine that return 
at least one URL labeled as malicious in the results page. 
Figure 3 provides a running average of this fraction. The 
graph shows an increasing trend in the search queries that 
return at least one malicious result, with an average ap- 
proaching 1.3% of the overall incoming search queries. 
This finding is troubling as it shows that a significant 
fraction of search queries return results that may expose 
the end-user to exploitation attempts. 

To further understand the importance of this finding, 
we inspect the prevalence of malicious sites among the 
links that appear most often in Google search results. 
From the top one million URLs appearing in the search 
engine results, about 6, 000 belong to sites that have been 
verified as malicious at some point during our data col- 
lection. Upon closer inspection, we found that these sites 
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Figure 3: Percentage of search queries that resulted in at 
least one URL labeled as malicious; 7-day running avg. 


appear at uniformly distributed ranks within the top mil- 
lion web sites—with the most popular landing page hav- 
ing a rank of 1,588. These results further highlight the 
significance of the web malware threat as they show the 
extent of the malware problem; in essence, about 0.6% 
of the top million URLs that appeared most frequently 
in Google’s search results led to exposure to malicious 
activity at some point. 

An additional interesting result is the geographic lo- 
cality of web based malware. Table 2 shows the ge- 
ographic breakdown of IP addresses of the top 5 mal- 
ware distribution sites and the landing sites. The results 
show that a significant number of Chinese-based sites 
contribute to the drive-by problem. Overall, 67% of the 
malware distribution sites and 64.6% of the landing sites 
are hosted in China. These findings provide more evi- 
dence [13] of poor security practices by web site admin- 
istrators, e.g., running out-dated and unpatched versions 
of the web server software. 


% of all 
dist. sites 


% of all 
landing sites 


dist. site 
hosting country 
China 
United States 
Russia 
Malaysia 
Korea 


landing site 
hosting country 
China 
United States 
Russia 
Korea 
Germany 





Table 2: Top 5 Hosting countries 


Upon closer inspection of the geographic locality of 
the web-malware distribution networks as a whole (i.e., 
the correlation between the location of a distribution site 
and the landing sites pointing to it), we see that the mal- 
ware distribution networks are highly localized within 
common geographical boundaries. This locality varies 
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across different countries, and is most evident in China, 
with 96% of the landing sites in China pointing to mal- 
ware distribution servers hosted in that country. 


4.1 Impact of browsing habits 


In order to examine the impact of users’ browsing habits 
on their exposure to exploitation via drive-by downloads, 
we measure the prevalence of malicious websites across 
the different website functional categories based on the 
DMOZ classification [1]. Using a large random sample 
of about 7.2 million URLs , we first map each URL to 
its corresponding DMOZ category. We were able to find 
the corresponding DMOZ categories for about 50% of 
these URLs”. We further inspect each URL through our 
indepth verification system then measure the percentage 
of malicious URLs in each functional category. Figure 4 
shows the prevalence of detected malicious and suspi- 
cious websites in each top level DMOZ category. 

As the graph illustrates, website categories associ- 
ated with “gray content” (e.g., adult websites) show a 
stronger connection to malicious content. For instance, 
about 0.6% of the URLs in the Adult category exhibited 
drive-by download activity upon visiting these websites. 
These results suggest that users who browse such web- 
sites will likely be more exposed to exploitation com- 
pared to users who browse websites from the other func- 
tional categories. However, an important observation 
from the same figure is that the distribution of malicious 
websites is not significantly skewed toward pages that 
serve gray content. In fact, the distribution shows that 
malicious websites are generally present in all website 
categories we observed. Overall, these results show that 
while “safe browsing” habits may limit users’ exposure 
to drive-by downloads it does not provide an effective 
safeguard against exploitation. 
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Figure 4: Prevalence of suspicious and malicious pages. 
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5 Malicious Content Injection 


In Section 4, we showed that exposure to web-malware 
is not strongly tied to a particular browsing habit. Our as- 
sertion is that this is due, in part, to the fact that drive-by 
downloads are triggered by visiting staging sites that are 
not necessarily of malicious intent but have content that 
lures the visitor into the malware distribution network. 


In this section, we validate this conjecture by study- 
ing the properties of the web sites that participate in the 
malware delivery trees. As discussed in Section 2, at- 
tackers use a number of techniques to control the con- 
tent of benign web sites and turn them into nodes in the 
malware distribution networks. These techniques can be 
divided into two categories: web server compromise and 
third party contributed content (e.g., blog posts). Unfor- 
tunately, it is generally difficult to determine the exact 
contribution of either category. In fact, in some cases 
even manual inspection of the content of each web site 
may not lead to conclusive evidence regarding the man- 
ner in which the malicious content was injected into the 
web site. Therefore, in this section we provide insights 
into some features of these web sites that may explain 
their presence in the malware delivery trees. We only fo- 
cus on the features that we can determine in an automated 
fashion. Specifically, where possible, we first inspect 
the version of the software running on the web server 
for each landing site. Additionally, we explore one im- 
portant angle that we discovered which contributes sig- 
nificantly to the distribution of web malware—namely, 
drive-by downloads via Ads. 


5.1 Web Server Software 


We first begin by examining (where possible) the soft- 
ware running on the web-servers for all the landing sites 
that lead to the malware distribution sites. Specifically, 
we collected all the “Server” and “X-Powered-By” 
header tokens from each landing page (see Table 3). 
Not surprisingly, of those servers that reported this in- 
formation, a significant fraction were running outdated 
versions of software with well known vulnerabilities>. 
For example, 38.1% of the Apache servers and 39.9% 
of servers with PHP scripting support reported a version 
with security vulnerabilities. Overall, these results reflect 
the weak security practices applied by the web site ad- 
ministrators. Clearly, running unpatched software with 
known vulnerabilities increases the risk of content con- 
trol via server exploitation. 
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Table 3: Server version for landing sites. In the case of Microsoft HS, we could not verify their version. 
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Figure 5: Percentage of landing sites potentially infect- 
ing visitors via malicious advertisements, and their rela- 
tive share in the search results. 


5.2. Drive-by Downloads via Ads 


Today, the majority of Web advertisements are dis- 
tributed in the form of third party content to the adver- 
tising web site. This practice is somewhat worrisome, as 
a web page is only as secure as it’s weakest component. 
In particular, even if the web page itself does not contain 
any exploits, insecure Ad content poses a risk to adver- 
tising web sites. With the increasing use of Ad syndica- 
tion (which allows an advertiser to sell advertising space 
to other advertising companies that in turn can yet again 
syndicate their content to other parties), the chances that 
insecure content gets inserted somewhere along the chain 
quickly escalates. Far too often, this can lead to web 
pages running advertisements to untrusted content. This, 
in itself, represents an attractive avenue for distributing 
malware, as it provides the adversary with a way to in- 
ject content to web sites with large visitor base without 
having to compromise any web server. 

To assess the extent of this behavior, we estimate the 
overall contribution of Ads to drive-by downloads. To 
do so, we construct the malware delivery trees from all 
detected malicious URLs following the methodology de- 
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scribed in Section 3. For each tree, we examine every 
intermediary node for membership in a set of 2, 000 well 
known advertising networks. If any of the nodes qual- 
ify, we count the landing site as being infectious via Ads. 
Moreover, to highlight the impact of the malware deliv- 
ered via Ads relative to the other mechanisms, we weight 
the landing sites associated with Ads based on the fre- 
quency of their appearance in Google search results com- 
pared to that of all landing sites. Figure 5 shows the 
percentage of landing sites belonging to Ad networks. 
On average, 2% of the landing sites were delivering mal- 
ware via advertisements. More importantly, the overall 
weighted share for those sites was substantial—on aver- 
age, 12% of the overall search results that returned land- 
ing pages were associated with malicious content due to 
unsafe Ads. This result can be explained by the fact that 
Ads normally target popular web sites, and so have a 
much wider reach. Consequently, even a small fraction 
of malicious Ads can have a major impact (compared to 
the other delivery mechanisms). 


Another interesting aspect of the results shown in Fig- 
ure 5 is that Ad-delivered drive-by downloads seem to 
appear in sudden short-lived spikes. This is likely due 
to the fact that Ads appearing on several advertising web 
sites are centrally controlled, and therefore allow the ma- 
licious content to appear on thousands of web sites sites 
almost instantaneously. Similarity, once detected, these 
Ads are removed simultaneously, and so disappear as 
quickly as they appeared. For this reason, we notice 
that drive-by downloads delivered by other content in- 
jection techniques (e.g., individual web servers compro- 
mise) have more lasting effect compared to Ad deliv- 
ered malware, as each web site must be secured inde- 
pendently. 


The general practice of Ad syndication contributes sig- 
nificantly to the rise of Ad delivered malware. Our re- 
sults show that overall 75% of the landing sites that de- 
livered malware via Ads use multiple levels of Ad syn- 
dication. To understand how far trust would have to ex- 
tend in order to limit the Ad delivered drive-by down- 
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Figure 6: CDF of the number of redirection steps for Ads 
that successfully delivered malware. 


loads, we plot the distribution of the path length from the 
landing site leading to the malware distribution sites for 
each delivery tree. The edges connecting the nodes in 
these paths reflect the number of redirects a browser has 
to follow before receiving the final payload. Hence, for 
syndicated Ads that delivered malware the path length 
is indicative of the number of syndication steps before 
reaching the final Ad; in our case, the malware payload. 
Figure 6 shows the distribution of the number of redi- 
rects for syndicated Ads that delivered malware relative 
to the other malicious landing URLs. The results are 
quite telling: malware delivered via Ads exhibits longer 
delivery chains, in 50% percent of all cases, more than 6 
redirection steps were required before receiving the mal- 
ware payload. Clearly, it is increasingly difficult to main- 
tain trust along such long delivery chains. 

Inspecting the delivery trees that featured syndication 
reveals a total of 55 unique Ad networks participating 
in these trees. We further studied the relative role of the 
different networks by evaluating the frequency of appear- 
ance of each Ad network in the malware delivery trees. 
Interestingly, our results show that five advertising net- 
works appear in approximately 75% of all malware deliv- 
ery trees. Figure 7 shows the distribution of the relative 
position of each network in the malware delivery chains 
it participated in. The normalized position is calculated 
by dividing the index of the Ad network in each chain 
by the length of the chain. The graph shows that these 
advertising networks split into three different categories: 
In the first category, which includes network I, the ad- 
vertising network appears at the beginning of the deliv- 
ery chain. In the second category, which includes net- 
works II-IV, advertising networks appear frequently 
in the middle of the delivery chains. In both these cat- 
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Figure 7: CDF of the normalized position of the top five 
Ad networks most frequently participating in malware 
delivery chains. 


egories advertising networks do not participate directly 
in delivering malware. However, the relative position of 
networks in the delivery chain may be used as an indi- 
cation of their relationship with the malware distribution 
sites — the deeper a network’s relative position the closer 
it is related to the malware distribution site. Finally, in 
the third category, indicated by network V, our analysis 
revealed that in almost 50% of all incidents, the advertis- 
ing network is directly delivering malware. For example, 
advertising network V pushes Ads that install malware in 
the form of a browser toolbar. 


Finally we further elucidate this problem via an in- 
teresting example from our data corpus. The landing 
page in our example refers to a Dutch radio station’s web 
site. The radio station in question was showing a ban- 
ner advertisement from a German advertising site. Us- 
ing JavaScript, that advertiser redirected to a prominent 
advertiser in the US, which in turn redirected to yet an- 
other advertiser in the Netherlands. That advertiser redi- 
rected to another advertisement (also in the Netherlands) 
that contained obfuscated JavaScript, which when un- 
obfuscated, pointed to yet another JavaScript hosted in 
Austria. The final JavaScript was encrypted and redi- 
rected the browser via multiple IFRAMEs to adxtnet.net, 
an exploit site hosted in Austria. This resulted in the 
automatic installation of multiple Trojan Downloaders. 
While it is unlikely that the initial advertising companies 
were aware of the malware installations, each redirection 
gave another party control over the content on the origi- 
nal web page—with predictable consequences. 
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6 Malware Distribution Infrastructure 


In this section, we explore various properties of the host- 
ing infrastructure for web malware. In particular, we ex- 
plore the size of of the malware distribution networks, 
and examine the distribution of binaries hosted across 
sites. We argue that such analysis is important, as it sheds 
light on the sophistication of the hosting infrastructures 
and the level of malfeasance we see today. As is the case 
with other recent malware studies (e.g., [5, 26, 21]) we 
hope that this analysis will be of benefit to researchers 
and practitioners alike. 
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Figure 8: CDF of the number of landing sites pointing to 
a particular malware distribution site. 


For the remaining discussion, recall that a malware 
distribution network constitutes all the landing sites that 
point to a single distribution site. Using the methodol- 
ogy described in Section 3, we identified the distribution 
networks associated with each malware distribution site. 
We first evaluate their size in terms of the total number of 
landing sites that point to them. Figure 8 shows the dis- 
tribution of sizes for the different distribution networks. 

The graph reveals two main types of malware distri- 
bution networks: (1) networks that use only one landing 
site, and (2) networks that have multiple landing sites. 
As the graph shows, distribution networks can grow to 
have well over 21,000 landing sites pointing to them. 
That said, roughly 45% of the detected malware distri- 
bution sites used only a single landing site at a time. We 
manually inspected some of these distribution sites and 
found that the vast majority were either subdomains on 
free hosting services, or short-lived domains that were 
created in large numbers. It is likely, though not con- 
firmed, that each of these sites used only a single landing 
site as a way to slip under the radar and avoid detection. 

Next, we examine the network location of the malware 
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Figure 9: The cumulative fraction of malware distribu- 
tion sites over the /8 IP prefix space. 


distribution servers and the landing sites linking to them. 
Figure 9 shows that the malware distribution sites are 
concentrated in a limited number of /8 prefixes. About 
70% of the malware distribution sites have IP addresses 
within 58.* -- 61.* and 209.* -- 221. net- 
work ranges. Interestingly, Anderson et al. [5] observed 
comparable IP space concentrations for the scam hosting 
infrastructure. The landing sites, however exhibit rela- 
tively more IP space diversity; Roughly 50% of the land- 
ing sites fell in the above ranges. 
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Figure 10: The cumulative fraction of the malware dis- 
tribution sites across the different ASes. 


We further investigated the Autonomous System (AS) 
locality of the malware distribution sites by mapping 
their IP addresses to the AS responsible for the longest 
matching prefixes for these IP addresses. We use the lat- 
est BGP snapshot from Routeviews [23] to do the IP to 
AS mapping. Our results show that all the malware dis- 
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10 


tribution sites’ IP addresses fall into a relatively small set 
of ASes — only 500 as of this writing. Figure 10 shows 
the cumulative fraction of these sites across the ASes 
hosting them (sorted in descending order by the number 
of sites in each AS). The graph further shows the highly 
nonuniform concentration of the malware distribution 
sites: 95% of these sites map to only 210 ASes. Finally, 
the results of mapping the landing sites (not shown) pro- 
duced 2,517 ASes with 95% of the sites falling in these 
500 ASes. 

Lastly, the distribution of malware across domains 
also gives rise to some interesting insights. Figure 11 
shows the distribution of the number of unique mal- 
ware binaries (as inferred from MD5 hashes) down- 
loaded from each malware distribution site. As the graph 
shows, approximately 42% of the distribution sites deliv- 
ered a single malware binary. The remaining distribution 
sites hosted multiple distinct binaries over their observa- 
tion period in our data, with 3% of the servers hosting 
more than 100 binaries. In many cases, we observed that 
the multiple payloads reflect deliberate obfuscation at- 
tempts to evade detection. In what follows, we take a 
more in-depth look by studying the different forms of re- 
lationships among the various distribution networks. 
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Figure 11: CDF of the number of unique binaries down- 
loaded from each malware distribution site. 


6.1 Relationships Among Networks 


To gain a better perspective on the degree of connectiv- 
ity between the distribution networks, we investigate the 
common properties of the hosting infrastructure across 
the malware distribution sites. We also evaluate the de- 
gree of overlap among the landing sites linking to the 
different malware distribution sites. 
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Malware hosting infrastructure. Throughout our 
measurement period we detected 9, 430 malware distri- 
bution sites. In 90% of the cases each site is hosted 
on a single IP address. The remaining 10% sites are 
hosted on IP addresses that host multiple malware distri- 
bution sites. Our results show IP addresses that hosted up 
to 210 malware distribution sites. Closer inspection re- 
vealed that these addresses refer to public hosting servers 
that allow users to create their own accounts. These 
accounts appear as sub-folders of the the virtual host- 
ing server DNS name (e.g., 5123 .com/akgy, 5123. 
com/alavin, 5123 .com/anti) or in many cases as 
separate DNS aliases that resolve to the IP address of the 
hosting server. We also observed several cases where the 
hosting server is a public blog that allows users to have 
their own pages (e.g., mihanblog.com/abadan2, 
mihanblog.com/askbox). 
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Figure 12: CDF of the normalized pairwise intersection 
between landing sites across distribution networks. 


Overlapping landing sites. We further evaluate the 
overlap between the landing sites that point to the dif- 
ferent malware distribution sites. To do so, we calculate 
the pairwise intersection between the sets of the landing 
sites pointing to each of the distribution sites in our data 
set. For a distribution network 7 with a set of landing 
sites X; and network 7 with the set of landing sites X,, 
the normalized pairwise intersection of the two networks, 
C;i,;, 1s calculated as, 


|X, 0X; 
C= () 

. |X; 

Where |X| is the number of elements in the set X. In- 
terestingly, our results showed that 80% of the distribu- 
tion networks share at least one landing page. Figure 12 
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shows the normalized pair-wise landing sets intersection 
across these distribution networks. The graph reveals a 
strong overlap among the landing sites for the related net- 
work pairs. These results suggest that many landing sites 
are shared among multiple distribution networks. For ex- 
ample, in several cases we observed landing pages with 
multiple IFRAMEs linking to different malware distribu- 
tion sites. Finally, we note that the sudden jump to a 
pair-wise score of one is mostly due to network pairs in 
which the landing sites for one network are a subset of 
those for the other network. 
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Figure 13: CDF of the normalized pairwise intersection 
between malware hashes across distribution networks. 


Content replication across malware distribution sites. 
We finally evaluate the extent to which malware is repli- 
cated across the different distribution sites. To do so, 
we use the same metric in Equation | to calculate the 
normalized pairwise intersection of the set of malware 
hashes served by each pair of distribution sites. Our re- 
sults show that in 25% of the malware distribution sites, 
at least one binary is shared between a pair of sites. 
While malware hashes exhibit frequent changes as a re- 
sult of obfuscation, our results suggest that there is still a 
level of content replication across the different sites. Fig- 
ure 13 shows the normalized pair-wise intersection of the 
malware sets across these distribution networks. As the 
graph shows, binaries are less frequently shared between 
distribution sites compared to landing sites, but taken as 
a whole, there is still a non-trivial degree of similarity 
among these networks. 


7 Post Infection Impact 


Recall that upon visiting a malicious URL, the browser 
downloads the initial exploit. The exploit (in most cases, 
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javascript) targets a vulnerability in the browser or 
one of its plugins and takes control of the infected sys- 
tem, after which it retrieves and runs the malware ex- 
ecutable(s) downloaded from the malware distribution 
site. Rather than inspecting the behavior of each phase 
in isolation, our goal is to give an overview of the col- 
lective changes that happen to the system state after vis- 
iting a malicious URL . Figure 14 shows the distribution 
of the number of Windows executables downloaded af- 
ter visiting a malicious URL as observed from monitor- 
ing the interaction between the browser and the malware 
distribution site. As the graph shows, visiting malicious 
URLs can lead to a large number of downloads (8 on av- 
erage, but as large as 60 in the extreme case). 
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Figure 14: CDF of the number of downloaded executa- 
bles as a result of visiting a malicious URL 


Another noticeable outcome is the increase in the 
number of running processes on the virtual machine. 
This increase is associated with the automatic execution 
of binaries. For each landing URL , we collected the 
number of processes that were started on the guest op- 
erating system after being infected with malware. Fig- 
ure 15 shows the CDF of the number of processes 
launched after the system is infected. As the graph shows 
visiting malicious URLs produces a noticeable increase 
in the number of processes, in some cases, inducing so 
much overhead that they “crashed” the virtual machine. 

Additionally, we examine the type of registry changes 
that occur when the malware executes. Overall, we 
detected registry changes after visiting 57.5% of the 
landing pages. We divide these changes into the fol- 
lowing categories: BHO indicates that the malware in- 
stalled a Browser Helper Object that can access privi- 
leged state in the browser; Preferences means that the 
browser home page, default search engine or name server 
where changed by the malware; Security indicates that 
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Figure 15: CDF of the number of processes started after 
visiting a malicious URL 


malware changed firewall settings or even disabled au- 
tomatic software updates; Startup indicates that the mal- 
ware is trying to persist across reboots. Notice that these 
categories are not mutually exclusive (i.e., a single ma- 
licious URL may cause changes in multiple categories). 
Table 4 summarizes the percentage of registry changes 
per category. Notice that “Startup” changes are more 
prevalent indicating that malware tries to persist even af- 
ter the machine is rebooted. 


Preferences 
23.5% 


Category | BHO 
URLs % | 6.99% 


Security Startup 
36.18% 51.27% 


Table 4: Registry changes from drive-by downloads. 


In addition to the registry changes, we analyzed the 
network activity of the virtual machine post infection. In 
our system, the virtual machines are allowed to perform 
only DNS and HTTP connections. Table 5 shows the 
percentage of connection attempts per destination port. 
Even though we omit the HTTP connections originat- 
ing from the browser, HTTP is still the most prevalent 
port for malicious activity post-infection. This is due 
to “downloader” binaries that fetch, in some cases, up 
to 60 binaries over HTTP. We also observe a significant 
percentage of connection attempts to typical IRC ports, 
accounting for more than 50% of all non-HTTP connec- 
tions. As a number of earlier studies have already shown 
(e.g., [6, 19, 8, 21, 22, 12]), the IRC connection attempts 
are most likely for unwillingly (to the owner) adding the 
compromised machine to an IRC botnet, confirming the 
earlier conjecture by Provos et al. [20] regarding the con- 
nection between web malware and botnets. More de- 
tailed examples of malware’s behavior can be found in 
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Table 5: Most frequently contacted ports directly by the 
downloaded malware. 


Polychronakis ef al. [18]. 


7.1 Anti-virus engine detection rates 


As we discussed earlier, web based malware uses a pull- 
based delivery mechanism in which a victim is required 
to visit the malware hosting server or any URL linking to 
it in order to download the malware. This behavior puts 
forward a number of challenges to defense mechanisms 
(e.g., malware signature generation schemes) mainly due 
to the inadequate coverage of the malware collection sys- 
tem. For example, unlike active scanning malware which 
uses a push-based delivery mechanism (and so sufficient 
placement of honeypot sensors can provide good cover- 
age), the web is significantly more sparse and, therefore, 
more difficult to cover. 

In what follows, we evaluate the potential implications 
of the web malware delivery mechanism by measuring 
the detection rates of several well known anti-virus en- 
gines. Specifically, we evaluate the detection rate of each 
anti-virus engine against the set of suspected malware 
samples collected by our infrastructure. Since we can not 
rely on anti-virus engines, we developed a heuristic to 
detect these suspected binaries before subjecting them to 
the anti-virus scanners. For each inspected URL via our 
in-depth verification system we test whether visiting the 
URL caused the creation of at least one new process on 
the virtual machine. For the URLs that satisfy this condi- 
tion, we simply extract any binary* download(s) from the 
recorded HTTP response and “flag” them as suspicious. 

We applied the above methodology to identify suspi- 
cious binaries on a daily basis over a one month period 
of April, 2007. We subject each binary for each of the 
anti-virus scanners using the latest virus definitions on 
that day. Then, for an anti-virus engine, the detection 
rate is simply the number of detected (flagged) samples 
divided by the total number of suspicious malware in- 
stances inspected on that day. Figure 16 illustrates the 
individual detection rates of each of the anti-virus en- 
gines. The graph reveals that the detection capability of 
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the anti-virus engines is lacking, with an average detec- 
tion rate of 70% for the best engine. These results are 
disturbing as they show that even the best anti-virus en- 
gines in the market (armed with their latest definitions) 
fail to cover a significant fraction of web malware. 
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Figure 16: Detection rates of 3 anti-virus engines. 


False Positives. Notice that the above strategy may 
falsely classify benign binaries as malicious. To eval- 
uate the false positives, we use the following heuristic: 
we optimistically assume that all suspicious binaries will 
eventually be discovered by the anti-virus vendors. Us- 
ing the set of suspicious binaries collected over a month 
historic period, we re-scan all undetected binaries two 
months later (in July, 2007) using the latest virus defini- 
tions. Then, all undetected binaries from the rescanning 
step are considered false positives. Overall, our results 
show that the earlier analysis is fairly accurate with false 
positive rates of less than 10%. We further investigated a 
number of binaries identified as false positives and found 
that a number of popular installers exhibit a behavior 
similar to that of drive-by downloads, where the installer 
process first runs and then downloads the associated soft- 
ware package. To minimize the impact of false positives, 
we created a white-list of all known benign downloads, 
and all binaries in the white-list are exempted from the 
analysis in this paper. 

Of course, we are being overly conservative here as 
our heuristic does not account for binaries that are never 
detected by any anti-virus engine. However, for our 
goals, this method produces an upper bound for the re- 
sulting false positives. As an additional benchmark we 
asked for direct feedback from anti-virus vendors about 
the accuracy of the undetected binaries that we (now) 
share with them. On average, they reported about 6% 
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false positives in the shared binaries, which is within the 
bounds of our prediction. 


8 Discussion 


Undoubtedly, the level of malfeasance on the Internet is a 
cause for concern. That said, while our work to date has 
shown that the prevalence of web-malware is indeed a 
serious threat, the analysis herein says nothing about the 
number of visitors that become infected as a result of vis- 
iting a malicious page. In particular, we note that since 
our goal is to survey the landscape, our infrastructure is 
intentionally configured to be vulnerable to a wide range 
of attacks; hopefully, savvy computer users who dili- 
gently apply software updates would be far less vulnera- 
ble to infection. To be clear, while our analysis unequiv- 
ocally shows that millions of users are exposed to ma- 
licious content every day, without a wide-scale browser 
vulnerability study, the actual number of compromises 
remains unknown. Nonetheless, we believe the perva- 
sive nature of the results in this study elucidates the state 
of the malware problem today, and hopefully, serves to 
educate both users, web masters and other researchers 
about the security challenges ahead. 

Lastly, we note that several outlets exists for taking 
advantage of the results of our infrastructure. For in- 
stance, the data that Google uses to flag search results 
is freely available through the Safe Browsing API [2], as 
well as via the Safe Browsing diagnostic page [3]. We 
hope these services prove to be of benefit to the greater 
community at large. 


9 Related Work 


Virtual machines have been used as honeypots for de- 
tecting unknown attacks by several researchers [4, 16, 
17, 25, 26]. Although, honeypots have traditionally been 
used mostly for detecting attacks against servers, the 
same principles also apply to client honeypots (e.g., an 
instrumented browser running on a virtual machine). For 
example, Moshchuk er al. used client-side techniques 
to study spyware on the web (by crawling 18 million 
URLs in May 2005 [17]). Their primary focus was not on 
detecting drive-by downloads, but in finding links to ex- 
ecutables labeled spyware by an adware scanner. Addi- 
tionally, they sampled 45, 000 URLs for drive-by down- 
loads and showed a decrease over time. However, the 
fundamental limitation of analyzing the malicious nature 
of URLs discovered by “spidering” is that a crawl can 
only follow content links, whereas the malicious nature 
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of a page is often determined by the web hosting infras- 
tructure. As such, while the study of Moshchuk ef al. 
provides valuable insights, a truly comprehensive analy- 
sis of this problem requires a much more in-depth crawl 
of the web. As we were able to analyze many billions of 
URLs , we believe our findings are more representative 
of the state of the overall problem. 

More closely related is the work of Provos et al. [20] 
and Seifert et al. [24] which raised awareness of the 
threat posed by drive-by downloads. These works are 
aimed at explaining how different web page compo- 
nents are used to exploit web browsers, and provides an 
overview of the different exploitation techniques in use 
today. Wang et al. proposed an approach for detecting 
exploits against Windows XP when visiting webpages in 
Internet Explorer [26]. Their approach is capable of de- 
tecting zero-day exploits against Windows and can de- 
termine which vulnerability is being exploited by expos- 
ing Windows systems with different patch levels to dan- 
gerous URLs. Their results, on roughly 17,000 URLs, 
showed that about 200 of these were dangerous to users. 

This paper differs from all of these works in that it of- 
fers a far more comprehensive analysis of the different 
aspects of the problem posed by web-based malware, in- 
cluding an examination of its prevalence, the structure of 
the distribution networks, and the major driving forces. 

Lastly, malware detection via dynamic tainting analy- 
sis may provide deeper insight into the mechanisms by 
which malware installs itself and how it operates [10, 15, 
27]. In this work, we are more interested in structural 
properties of the distribution sites themselves, and how 
malware behaves once it has been implanted. Therefore, 
we do not employ tainting because of its computational 
expense, and instead, simply collect changes made by the 
malware that do not require having the ability to trace the 
information flow in detail. 


10 Conclusion 


The fact that malicious URLs that initiate drive-by down- 
loads are spread far and wide raises concerns regarding 
the safety of browsing the Web. However, to date, little 
is known about the specifics of this increasingly common 
malware distribution technique. In this work, we attempt 
to fill in the gaps about this growing phenomenon by pro- 
viding a comprehensive look at the problem from several 
perspectives. Our study uses a large scale data collection 
infrastructure that continuously detects and monitors the 
behavior of websites that perpetrate drive-by downloads. 
Our in-depth analysis of over 66 million URLs (spanning 
a 10 month period) reveals that the scope of the problem 


17th USENIX Security Symposium 


is significant. For instance, we find that 1.3% of the in- 
coming search queries to Google’s search engine return 
at least one link to a malicious site. 

Moreover, our analysis reveals several forms of rela- 
tions between some distribution sites and networks. A 
more troubling concern is the extent to which users may 
be lured into the malware distribution networks by con- 
tent served through online Ads. For the most part, the 
syndication relations that implicitly exist in advertising 
networks are being abused to deliver malware through 
Ads. Lastly, we show that merely avoiding the dark 
corners of the Internet does not limit exposure to mal- 
ware. Unfortunately, we also find that even state-of-the- 
art anti-virus engines are lacking in their ability to protect 
against drive-by downloads. While this is to be expected, 
it does call for more elaborate defense mechanisms to 
curtail this rapidly increasing threat. 
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Notes 


!Some compromised web servers also trigger dialog windows ask- 
ing users to manually download and run malware. However, this anal- 
ysis considers only malware installs that require no user interaction. 

?This mapping is readily available at Google. 

3We consider a version as outdated if it is older than the latest corre- 
sponding version released by January, 2007 (the start date for our data 
collection). 

4We restrict our analysis to Windows executables identified by 
searching for PE headers in each payload. 
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Abstract 


Many web sites embed third-party content in frames, re- 
lying on the browser’s security policy to protect them 
from malicious content. Frames, however, are often in- 
sufficient isolation primitives because most browsers let 
framed content manipulate other frames through naviga- 
tion. We evaluate existing frame navigation policies and 
advocate a stricter policy, which we deploy in the open- 
source browsers. In addition to preventing undesirable 
interactions, the browser’s strict isolation policy also hin- 
ders communication between cooperating frames. We 
analyze two techniques for inter-frame communication. 
The first method, fragment identifier messaging, pro- 
vides confidentiality without authentication, which we 
repair using concepts from a well-known network pro- 
tocol. The second method, postMessage, provides 
authentication, but we discover an attack that breaches 
confidentiality. We modify the postMessage API to 
provide confidentiality and see our modifications stan- 
dardized and adopted in browser implementations. 


1 Introduction 


Web sites contain content from sources of varying trust- 
worthiness. For example, many web sites contain third- 
party advertising supplied by advertisement networks or 
their sub-syndicates [6]. Other common aggregations 
of third-party content include Flickr albums [12], Face- 
book badges [9], and personalized home pages offered 
by the three major web portals [15, 40, 28]. More ad- 
vanced uses of third-party components include Yelp’s 
use of Google Maps [14] to display restaurant locations 
and the Windows Live Contacts gadget [27]. A web 
site combining content from multiple sources is called a 
mashup, with the party combining the content called the 
integrator and integrated content called a gadget. In sim- 
ple mashups, the integrator does not intend to communi- 
cate with the gadgets and requires only that the browser 


USENIX Association 


Collin Jackson 
Stanford University 
collinj @cs.stanford.edu 


John C. Mitchell 
Stanford University 
mitchell@ cs.stanford.edu 


isolate frames. In more complex mashups, the integra- 
tor does intend to communicate with the gadgets and re- 
quires secure inter-frame communication. 

In this paper, we study the contemporary web ver- 
sion of a recurring problem in computer systems: isolat- 
ing untrusted, or partially trusted, software components 
while providing secure inter-component communication. 
Whenever a site integrates third-party content, such as 
an advertisement, a map, or a photo album, the site runs 
the risk of incorporating malicious content. Without iso- 
lation, malicious content can compromise the confiden- 
tiality and integrity of the user’s session with the inte- 
grator. While the browser’s well-known “same-origin 
policy” [34] restricts script running in one frame from 
manipulating content in another frame, the browser uses 
a different policy to determine whether one frame is al- 
lowed to navigate (change the location of) another frame. 
Although restricting navigation is essential to providing 
isolation, navigation also enables one form of inter-frame 
communication used in mashup frameworks from lead- 
ing companies. Furthermore, we show that an attacker 
can use frame navigation to attack another inter-frame 
communication mechanism, postMessage. 


Isolation. We examine the browser frame as an iso- 
lation primitive. Because frames can contain untrusted 
content, the browser’s security policy restricts frame in- 
teractions. Many browsers, however, insufficiently re- 
strict the ability of one frame to navigate another frame 
to a new location. These overly permissive frame nav- 
igation policies lead to a variety of attacks, which we 
demonstrate against the Google AdSense login page and 
the iGoogle gadget aggregator. To prevent these attacks, 
we propose tightening the browser’s frame navigation 
policy while maintaining compatibility with existing web 
content. We have collaborated with browser vendors to 
deploy this policy in Firefox 3 and Safari 3.1. As the 
policy is already implemented in Internet Explorer 7, the 
policy is now deployed in the three most-used browsers. 
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Table 1: Security properties of frame communication channels 


Communication. With strong isolation, frames are 
limited in their interactions, raising the issue of how iso- 
lated frames can cooperate as part of a mashup. We 
analyze two techniques for inter-frame communication: 
fragment identifier messaging and postMessage. The 
results of our analysis are summarized in Table 1. 


e Fragment identifier messaging uses characteristics 
of frame navigation to send messages between 
frames. As it was not designed for communica- 
tion, the channel has less-than-desirable security 
properties: messages are confidential but senders 
are not authenticated. To understand these prop- 
erties, we draw an analogy between this commu- 
nication channel and a network channel in which 
senders encrypt their messages to their recipi- 
ent’s public key. For concreteness, we examine 
the Microsoft.Live.Channel1s library [27], 
which uses fragment identifier messaging to let 
the Windows Live Contacts gadget communicate 
with its integrator. The protocol used by Win- 
dows Live is analogous to the Needham-Schroeder 
public-key protocol [29]. We discover an attack 
on this protocol, related to Lowe’s anomaly in the 
Needham-Schroeder protocol [23], in which a mali- 
cious gadget can impersonate the integrator to the 
Contacts gadget. We suggested a solution based 
on Lowe’s improvement to the Needham-Schroeder 
protocol [23], and Microsoft implemented and de- 
ployed our suggestion within days. 


e postMessage is a new browser API designed for 
inter-frame communication [19]. postMessage 
is implemented in Opera, Internet Explorer 8, Fire- 
fox 3, and Safari. Although postMessage has 
been deployed since 2005, we demonstrate an attack 
on the channel’s confidentiality using frame navi- 
gation. In light of this attack, the postMessage 
channel provides authentication but lacks confiden- 
tiality, analogous to a channel in which senders 
cryptographically sign their messages. To se- 
cure the channel, we propose a change to the 
postMessage API. We implemented our change 
in patches for Safari and Firefox. Our proposal has 
been adopted by the HTML 5 working group, Inter- 
net Explorer 8, Firefox 3, and Safari. 
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Organization. The remainder of the paper is organized 
as follows. Section 2 details the threat model for these at- 
tacks. Section 3 surveys existing frame navigation poli- 
cies and converges browsers on a secure policy. Sec- 
tion 4 analyzes two frame communication mechanisms, 
demonstrates attacks, and proposes defenses. Section 5 
describes related work. Section 6 concludes. 


2 Threat Model 


In this paper, we are concerned with securing in-browser 
interactions from malicious attackers. We assume an 
honest user employs a standard web browser to view con- 
tent from an honest web site. A malicious “web attacker” 
attempts to disrupt this interaction or steal sensitive infor- 
mation. Typically, a web attacker places malicious con- 
tent (e.g., JavaScript) in the user’s browser and modifies 
the state of the browser, interfering with the honest ses- 
sion. To study the browser’s security policy, which deter- 
mines the privileges of the attacker’s content, we define 
the web attacker threat model below. 


Web Attacker. A web attacker is a malicious princi- 
pal who owns one or more machines on the network. In 
order to study the security of browsers when rendering 
malicious content, we assume that the browser gets and 
renders content from the attacker’s web site. 


e Network Abilities. The web attacker has no spe- 
cial network abilities. In particular, the web attacker 
can send and receive network messages only from 
machines under his or her control, possibly acting 
as a client or server in network protocols of the at- 
tacker’s choice. Typically, the web attacker uses at 
least one machine as an HTTP server, which we 
refer to for simplicity as attacker.com. The 
web attacker can obtain SSL certificates for do- 
mains he or she owns; certificate authorities such 
as instantss1.com provide such certificates for 
free. The web attacker’s network abilities are decid- 
edly weaker than the usual network attacker consid- 
ered in studies of network security because the web 
attacker can neither eavesdrop on messages sent to 
other recipients nor forge messages from other net- 
work locations. For example, a web attacker cannot 
act as a “man-in-the-middle.” 
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e Interaction with Client. We assume the honest 
user views attacker. comin at least one browser 
window, thereby rendering the attacker’s content. 
We make this assumption because we believe that 
an honest user’s interaction with an honest site 
should be secure even if the user separately vis- 
its a malicious site in a different browser window. 
We assume the web attacker is constrained by the 
browser’s security policy and does not employ a 
browser exploit to circumvent the policy. The web 
attacker’s host privileges are decidedly weaker than 
an attacker who can execute a arbitrary code on the 
user’s machine with the user’s privileges. For exam- 
ple, a web attacker cannot install or run a system- 
wide key logger or botnet client. 


Attacks accessible to a web attacker have significant 
practical impact because the attacks can be mounted 
without any complex or unusual control of the network. 
In addition, web attacks can be carried out by a standard 
man-in-the-middle network attacker, provided the user 
visits a single HTTP site, because a man-in-the-middle 
can intercept HTTP requests and inject malicious content 
into the reply, simulating a reply from attacker.com. 

There are several techniques an attacker can use to 
drive traffic to attacker.com. For example, an at- 
tacker can place web advertisements, display popular 
content indexed by search engines, or send bulk e-mail to 
attract users. Typically, simply viewing an attacker’s ad- 
vertisement lets the attacker mount a web-based attack. 
In a previous study [20], we purchased over 50,000 im- 
pressions for $30. During each of these impressions, a 
user’s browser rendered our content, giving us the access 
required to mount a web attack. 

We believe that a normal, but careful, web user who 
reads news and conducts banking, investment, and re- 
tail transactions, cannot effectively monitor or restrict the 
provenience of all content rendered in his or her browser, 
especially in light of third-party advertisements. In other 
words, we believe that the web attacker threat model is an 
accurate representation of normal web behavior, appro- 
priate for security analysis of browser security, and not 
an assumption that users promiscuously visit all possible 
bad sites in order to tempt fate. 


Gadget Attacker. A gadget attacker is a web attacker 
with one additional ability: the integrator embeds a gad- 
get of the attacker’s choice. This assumption lets us ac- 
curately evaluate mashup isolation and communication 
protocols because the purpose of these protocols is to let 
an integrator embed untrusted gadgets safely. In practice, 
a gadget attacker can either wait for the user to visit the 
integrator or can redirect the user to the integrator’s web 
site from attacker.com. 
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Out-of-Scope Threats. Although phishing [11,7] can 
be described informally as a “web attack,’ the web 
attacker defined above does not attempt to fool the 
user by choosing a confusing domain name (such as 
bankofthevvest.com) or using other social engi- 
neering. In particular, we do not assume that a user 
treats attacker.com as if it were a site other than 
attacker.com. The attacks presented in this paper 
are “pixel-perfect” in the sense that the browser provides 
the user no indication whatsoever that an attack is under- 
way. The attacks do not display deceptive images over 
the browser security indicators nor do they spoof the lo- 
cation bar and or the lock icon. In this paper, we do not 
consider cross-site scripting attacks, in which an attacker 
exploits a bug in an honest principal’s web site to inject 
malicious content into another security origin. None of 
the attacks described in this paper rely on the attacker 
injecting content into another principal’s security origin. 
Instead, we focus on privileges the browser itself affords 
the attacker to interact with honest sites. 


3 Frame Isolation 


Netscape Navigator 2.0 introduced the HTML <frame> 
element, which allows web authors to delegate a portion 
of their document’s screen real estate to another doc- 
ument. These frames can be navigated independently 
of the rest of the main content frame and can, them- 
selves, contain frames, further delegating screen real es- 
tate and creating a frame hierarchy. Most modern frames 
are embedded using the more-flexible <iframe> ele- 
ment, introduced in Internet Explorer 3.0. In this paper, 
we use the term frame to refer to both <frame> and 
<iframe> elements. The main, or top-level, frame of 
a browser window displays its location in the browser’s 
location bar. Subframes are often indistinguishable from 
other parts of a page, and the browser does not display 
their location in its user interface. Browsers decorate a 
window with a lock icon only if every frame contained 
in the window was retrieved over HTTPS but do not re- 
quire the frames to be served from the same host. For ex- 
ample, if https: //bank.com/ embeds a frame from 
https://attacker.com/, the browser will deco- 
rate the window with a lock icon. 


Organization. Section 3.1 reviews browser security 
policies. Section 3.2 describes cross-window frame 
navigation attacks and defenses. Section 3.3 details 
same-window attacks that are not impeded by the cross- 
window defenses. Section 3.4 analyzes stricter naviga- 
tion policies and advocates the “descendant policy.” Sec- 
tion 3.5 documents our implementation and deployment 
of the descendant policy in major browsers. 


17th USENIX Security Symposium 19 


20 


3.1 Background 


Scripting Policy. Most web security is focused on the 
browser’s scripting policy, which answers the question 
“when is script in one frame permitted to manipulate the 
contents of another frame?” The scripting policy is the 
most important browser security policy because the abil- 
ity to script another frame is the ability to control its 
appearance and behavior completely. For example, if 
otherWindow is another window’s frame, 


var stolenPassword = 
otherWindow.document.forms[0]. 
password.value; 


attempts to steal the user’s password in the other win- 
dow. Modern web browsers permit one frame to read 
and write all the DOM properties of another frame only 
when their content was retrieved from the same ori- 
gin, i.e. when the scheme, host, and port number of 
their locations match. If the content of othe rWindow 
was retrieved from a different origin, the browser’s se- 
curity policy will prevent this script from accessing 
otherWindow.document. 


Navigation Policy. Every browser must answer the 
question “when is one frame permitted to navigate an- 
other frame?” Prior to 1999, all web browsers imple- 
mented a permissive policy: 





Permissive Policy 
A frame can navigate any other frame. 











For example, if ot he rWindow includes a frame, 


otherWindow.frames[0].location = 
"https://attacker.com/"; 


navigates the frame to https: //attacker.com/. 
This has the effect of replacing the frame’s docu- 
ment with content retrieved from that URL. Under 
the permissive policy, this navigation succeeds even if 
otherWindow contains content from a different secu- 
rity origin. There are a number of other idioms for navi- 
gating frames, including 


window.open("https://attacker.com/", 
"frameName") ; 


which requests that the browser search for a frame named 
frameName and navigate the frame to the specified 
URL. Frame names exist in a global name space and are 
not restricted to a single security origin. 
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Top-level Frames. Top-level frames are often exempt 
from the restrictions imposed by the browser’s frame 
navigation policy. Top-level frames are less vulnerable 
to frame navigation attacks because the browser displays 
their location in the location bar. Internet Explorer and 
Safari do not restrict the navigation of top-level frames 
at all. Firefox restricts the navigation of top-level frames 
based on their openers, but this restriction can be circum- 
vented [2]. Opera implements a number of restrictions 
on the navigation of top-level frames based on the cur- 
rent location of the frame. 


3.2 Cross-Window Attacks 


In 1999, Georgi Guninski discovered that the permis- 
sive frame navigation policy admits serious attacks [16]. 
Guninski discovered that, at the time, the password 
field on the CitiBank login page was contained within 
a frame. Because the permissive frame navigation policy 
lets any frame navigate any other frame, a web attacker 
can navigate the password frame on CitiBank’s page 
to https://attacker.com/, replacing the frame 
with identical-looking content that sends the user’s pass- 
word to attacker.com. In the modern web, this 
cross-window attack might proceed as follows: 


1. The user reads a popular blog that displays a Flash 
advertisement provided by attacker.com. 


2. The user opens a new window to bank.com, 
which displays its password field in a frame. 


3. The malicious advertisement navigates the pass- 
word frame to https://attacker.com/. The 
location bar still reads bank . com and the lock icon 
is not removed. 


4. The user enters his or her password, which is then 
submitted to attacker.com. 


Of the browsers in heavy use today, Internet Explorer 6 
and Safari 3 both implement the permissive policy. In- 
ternet Explorer 7 and Firefox 2 implement stricter poli- 
cies (described in subsequent sections). However, Flash 
Player can be used to circumvent the stricter navigation 
policy of Internet Explorer 7, effectively reducing the 
policy to “permissive.” Many web sites are vulnerable to 
this attack, including Google AdSense, which displays 
its password field inside a frame; see Figure 1. 


Window Policy. In response to Guninski’s report, 
Mozilla implemented a stricter policy in 2001: 





Window Policy 
A frame can navigate only frames in its window. 
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Figure 1: Cross-Window Attack: The attacker controls the password field because it is contained within a frame. 


This policy prevents the cross-window attack because the 
web attacker does not control a frame in the same win- 
dow as the CitiBank or the Google AdSense login page. 
Without a foothold in the window, the attacker cannot 
navigate the login frame to attacker.com. 


3.3 Same-Window Attacks 


The window frame navigation policy is neither univer- 
sally deployed nor sufficiently strict to protect users on 
the modern web because mashups violate its implicit se- 
curity assumption that an honest principal will not embed 
a frame to a dishonest principal. 


Mashups. <A mashup combines content from multiple 
sources to create a single user experience. The party 
combining the content is called the integrator and the 
integrated content is called a gadget. 


e Aggregators. Gadget aggregators, such as 
iGoogle [15], My Yahoo [40], and Win- 
dows Live [28], are one form of mashup. These 
sites let users customize their experience by se- 
lecting gadgets (such as stock tickers, weather 
predictions, news feeds, etc) to include on their 
home page. Third parties are encouraged to develop 
gadgets for the aggregator. These mashups embed 
the selected gadgets in a frame and rely on the 
browser’s frame isolation to protect users from 
malicious gadgets. 


e Advertisements. Web advertising is a simple form 
of mashup, combining first-party content, such as 
news articles or sports statistics, with third-party ad- 
vertisements. Typically, the publisher (the integra- 
tor) delegates a portion of its screen real estate to an 
advertisement network, such as Google, Yahoo, or 
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Microsoft, in exchange for money. Most advertise- 
ments, including Google AdWords, are contained in 
frames, both to prevent the advertisers (who provide 
the gadgets) from interfering with the publisher’s 
site and to prevent prevent the publisher from using 
JavaScript to click on the advertisements. 


We refer to aggregators and advertisements as simple 
mashups because these mashups do not involve commu- 
nication between the gadgets and the integrator. Simple 
mashups rely on the browser to provide isolation but do 
not require inter-frame communication. 


Gadget Hijacking Attacks. Mashups invalidate an 
implicit assumption of the window policy, that an hon- 
est principal will not embed a frame to a dishonest prin- 
cipal. A gadget attacker, however, does control a frame 
embedded by the honest integrator, giving the attacker 
the foothold required to mount a gadget hijacking at- 
tack [22]. In such an attack, a malicious gadget navi- 
gates a target gadget to attacker. com and imperson- 
ates the gadget to the user. 


e Aggregator Vulnerabilities. iGoogle is vulnerable 
to gadget hijacking in browsers, such as Firefox 2, 
that implement the permissive or window policies; 
see Figure 2. Consider, for example, one popu- 
lar iGoogle gadget that lets users access their Hot- 
mail inbox. (This gadget is neither provided nor 
endorsed by Microsoft.) If the user is not logged 
into Hotmail, the gadget requests the user’s Hotmail 
password. A malicious gadget can replace the Hot- 
mail gadget with content that asks the user for his or 
her Hotmail password. As in the cross-window at- 
tack, the user is unable to distinguish the malicious 
password field from the honest password field. 
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Figure 2: Gadget Hijacking Attack. Under the window policy, the attacker gadget can navigate the other gadgets. 


e Advertisement Vulnerabilities. Although text ad- 
vertisements often do not contain active content 
(e.g., JavaScript), other forms of advertising, such 
as Flash advertisements, do contain active content. 
An attacker who provides such an advertisement 
can steal advertising impressions allotted to other 
advertisers via gadget hijacking. A malicious ad- 
vertisement can traverse the page’s frame hierar- 
chy and navigate frames containing other advertise- 
ments to attacker.com, replacing the existing 
content with the attacker’s advertisement. 


3.4 Stricter Policies 


Although browser vendors do not document their naviga- 
tion policies, we were able to reverse engineered the nav- 
igation policies of existing browsers, and we confirmed 
our understanding with the browsers’ developers. The 
existing policies are shown in Table 2. In addition to 
the permissive and window policies described above, we 
discovered two other frame navigation policies: 





Descendant Policy 
A frame can navigate only its descendants. 








Child Policy 
A frame can navigate only its direct children. 











The Internet Explorer 6 team wanted to enable the child 
policy by default, but shipped the permissive policy be- 
cause the child policy was incompatible with a large 
number of web sites. The Internet Explorer 7 team de- 
signed the descendant policy to balance the security re- 
quirement to defeat the cross-window attack with the 
compatibility requirement to support existing sites [33]. 
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Pixel Delegation. The descendant policy provides the 
most attractive trade-off between security and compat- 
ibility because it is the least restrictive policy that re- 
spects pixel delegation. When one frame embeds another 
frame, the parent frame delegates a region of the screen 
to the child frame. The browser prevents the child frame 
from drawing outside of its bounding box but does al- 
low the parent frame to draw over the child using the 
position: absolute style. The descendant policy 
permits a frame to navigate a target frame precisely when 
the frame could overwrite the screen real estate of the tar- 
get frame. Although the child policy is stricter than the 
descendant policy, the additional strictness does not pre- 
vent many additional attacks because a frame can sim- 
ulate the visual effects of navigating a grandchild frame 
by drawing over the region of the screen occupied by 
the grandchild frame. The child policy’s added strictness 
does, however, reduce the policy’s compatibility with ex- 
isting sites, discouraging browser vendors from deploy- 
ing the child policy. 


Origin Propagation. A strict interpretation of the de- 
scendant policy prevents a frame from navigating its sib- 
lings, even if the frame is from the same security origin 
as its parent. In this situation, the frame can navigate its 
sibling indirectly by injecting script into its parent, which 
can then navigate the sibling because the sibling is a de- 
scendant of the parent frame. In general, browsers should 
decide whether or not to permit a navigation based on the 
active frame’s security origin. Browsers should let an ac- 
tive frame navigate a target frame if there exists a frame 
in the same security origin as the active frame that has 
the target frame as a descendant. By recognizing this ori- 
gin propagation, browsers can achieve a better trade-off 
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between security and compatibly. These additional navi- 
gations do not sacrifice security because an attacker can 
perform the navigations indirectly, but allowing them is 
more convenient for honest web developers. 


3.5 Deployment 


We collaborated with the HTML 5 working group [18] 
and browser vendors to deploy the descendant policy in 
several browsers: 


e Safari. We implemented the descendant policy as 
a patch for Safari. Apple accepted our patch and 
deployed the descendant policy to Mac OS X and 
Windows Safari users as a security update [30]. Ap- 
ple also deployed our patch to all iPhone and iPod 
touch users. 


Firefox. We implemented the descendant policy as 
a patch for Firefox. Before accepting our patch, 
Mozilla requested tests for all their previous frame 
navigation regressions. We provided them with ap- 
proximately 1000 lines of regression tests for their 
automatic test harness, covering the frame naviga- 
tion security vulnerabilities from the past ten years. 
Mozilla accepted our patch and deployed the de- 
scendant policy in Firefox 3 [1]. 


Flash. We reported to Adobe that Flash Player by- 
passes the descendant policy in Internet Explorer 7. 
Adobe agreed to ship a patch to all Internet Explorer 
users in their next security update. 


Opera. We notified Opera Software about inconsis- 
tencies in Opera’s child policy that can be used in 
gadget hijacking attacks. They plan to fix these vul- 
nerabilities in the upcoming release of Opera 9.5, 
and are evaluating the compatibility benefits of 
adopting the descendant policy [35]. 


4 Frame Communication 


Over the past few years, web developers have built so- 
phisticated mashups that, unlike simple aggregators and 
advertisements, are comprised of gadgets that commu- 
nicate with each other and with their integrator. Yelp, 
which integrates the Google Maps gadget, motivates the 
need for secure inter-frame communication by illustrat- 
ing how communicating gadgets are used in real de- 
ployments. Sections 4.1 and 4.2 analyze and improve 
fragment-identifier messaging and postMessage. 


Google Maps. One popular gadget is the Google Maps 
API [14]. Google provides two mechanisms for integrat- 
ing Google Maps: 
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e Frame. In the frame version of the gadget, the in- 
tegrator embeds a frame to maps.google.com, 
which Google fills with a map centered at the speci- 
fied location. The user can interact with map, but 
the integrator is oblivious to this interaction and 
cannot interact with the map directly. 


e Script. In the script version of the gadget, the 
integrator embeds a <script> tag that executes 
JavaScript from maps .google.com. This script 
creates a rich JavaScript API the integrator can use 
to interact with the map, but the script runs with all 
of the integrator’s privileges. 


Yelp. Yelp is a popular review web site that uses the 
Google Maps gadget to display the locations of restau- 
rants and other businesses it reviews. Yelp requires a 
high degree of interactivity with the Maps gadget be- 
cause it places markers on the map for each restaurant 
and displays the restaurant’s review when the user clicks 
on the marker. In order to deliver these advanced fea- 
tures, Yelp must use the script version of the Maps gad- 
get. This design requires Yelp to trust Google Maps com- 
pletely because Google’s script runs with Yelp’s priv- 
ileges in the user’s browser, granting Google the abil- 
ity to manipulate Yelp’s reviews and steal Yelp’s cus- 
tomer’s information. Although Google might be trust- 
worthy, the script approach does not scale beyond highly 
respected gadget providers. Secure inter-frame commu- 
nication provides the best of both alternatives: Yelp (and 
similar sites) can realize the interactivity of the script ver- 
sion of Google Maps gadget while maintaining the secu- 
rity of the frame version of the gadget. 


4.1 The Fragment Identifier Channel 


Although the browser’s scripting policy isolates frames 
from different security origins, clever mashup designers 
have discovered an unintended channel between frames: 
the fragment identifier channel [3, 36]. This channel is 
regulated by the browser’s less-restrictive frame naviga- 
tion policy. This “found” technology lets mashup devel- 
opers place each gadget in a separate frame and rely on 
the browser’s security policy to prevent malicious gad- 
gets from attacking the integrator and honest gadgets. 


Mechanism. Normally, when a frame is navigated to 
a new URL, the browser retrieves the URL from the 
network and replaces the frame’s document with the 
retrieved content. However, if the new URL differ- 
ent from the old URL only in the fragment (the por- 
tion after the #), then the browser does not reload 
the frame. If frames[0] is currently located at 
http://example.com/doc, 
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Table 2: Frame navigation policies deployed in existing browsers. 


frames[0].location = 
"http://example.com/doc#message"; 


changes the frame’s location without reloading the frame 
or destroying its JavaScript context. The frame can ob- 
serve the value of the fragment by periodically polling 
window.location.hash to see if the fragment 
identifier has changed. This technique can be used to 
send short string messages entirely within the browser, 
avoiding network latency. However, the communication 
channel is somewhat unreliable because, if two naviga- 
tions occur between polls, the first message will be lost. 


Security Properties. Because it was “found” and not 
designed, the fragment identifier channel has less-than- 
ideal security properties. The browser’s scripting policy 
prevents security origins other than the one preceding the 
# from eavesdropping on messages because they are un- 
able to read the frame’s location (even though the nav- 
igation policy permits them to write to the frame’s lo- 
cation). Browsers also prevent arbitrary security origins 
from tampering with portions of messages. Other secu- 
rity origins can, however, overwrite the fragment iden- 
tifier in its entirety, leaving the recipient to guess the 
sender of each message. 

To understand these security properties, we develop 
an analogy with well-known properties of network chan- 
nels. We view the browser as guaranteeing that the frag- 
ment identifier channel has confidentiality: a message 
can be read only by its intended recipient. The fragment 
identifier channel fails to be a secure channel because it 
lacks authentication, the ability of the recipient to un- 
ambiguously determine the sender of a message. The 
channel also fails to be reliable because messages might 
not be delivered, and the attacker might be able to replay 
previous messages using the browser’s history API. 

The security properties of the fragment identifier chan- 
nel are analogous to a channel on an untrusted network 
secured by a public-key cryptosystem in which each 
message is encrypted with the public key of its intended 
recipient. In both cases, if Alice sends a message to Bob, 
no one except Bob learns the contents of the message 
(unless Bob forwards the message). In both settings, the 
channel does not provide a reliable procedure for deter- 
mining who sent a given message. There are two inter- 
esting differences between the fragment identifier chan- 
nel and the public-key channel: 
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1. The public-key channel is susceptible to traffic anal- 
ysis, but an attacker cannot determine the length of 
a message sent over the fragment identifier channel. 
An attacker can extract timing information by fre- 
quently polling the browser’s clock, but obtaining a 
high-resolution timing signal significantly degrades 
the browser’s performance. 


2. The fragment identifier channel is constrained by 
the browser’s frame navigation policy. In principle, 
this could be used to construct protocols secure for 
the fragment identifier channel that are insecure for 
the public-key channel (by preventing the attacker 
from navigating the recipient), but in practice this 
restriction has not prevented us from constructing 
attacks on existing protocol implementations. 


Despite these differences, we find the network analogy 
useful in analyzing inter-frame communication. 


Windows Live Channels. Microsoft uses the frag- 
ment identifier channel in its Windows Live plat- 
form library to implement a higher-level channel API, 
Microsoft.Live.Channels [36]. The Windows 
Live Contacts gadget uses this API to communicate with 
its integrator. The integrator can instruct the gadget to 
add or remove contacts from the user’s contacts list, and 
the gadget can send the integrator details about the user’s 
contacts. Whenever the integrator asks the gadget to per- 
form a sensitive action, the gadget asks the user to con- 
firm the operation and displays the integrator’s host name 
to aid the user in making trust decisions. 

Microsoft.Live.Channels attempts to build a 
secure channel over the fragment identifier channel. By 
reverse engineering the implementation, we determined 
that it uses two sessions of the following protocol (one in 
each direction) to establish a secure channel: 


A— B:Nag,URI,4 
B—>A:WN,,Np 
A— B: Ng, Message, 


In this notation, A and B are frames, N4 and Nz are 
fresh nonces (numbers chosen at random during each 
run of the protocol), and URI, is the location of A’s 
frame. Under the network analogy described above, 
this protocol is analogous to a variant of the classic 
Needham-Schroeder key-establishment protocol [29]. 
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integrator.com, but in reality the request was made 
by attacker.com. 


The Needham-Schroeder protocol was designed to estab- 
lish a shared secret between two parties over an insecure 
channel. In the Needham-Schroeder protocol, each mes- 
sage is encrypted with the public key of its intended re- 
cipient. The Windows Live protocol does not employ en- 
cryption because the fragment identifier channel already 
provides the required confidentiality. 

The Needham-Schroeder protocol has a well-known 
anomaly, due to Lowe [23], which leads to an attack in 
the browser setting. In the Lowe scenario, an honest prin- 
cipal, Alice, initiates the protocol with a dishonest party, 
Eve. Eve then convinces honest Bob that she is Alice. In 
order to exploit the Lowe anomaly, an honest principal 
must be willing to initiate the protocol with a dishonest 
principal. This requirement is met in mashups because 
the integrator initiates the protocol with the gadget at- 
tacker’s gadget in order to establish a channel. The Lowe 
anomaly can be exploited to impersonate the integrator to 
the Windows Live Contacts gadget as follows: 


SMash and OpenAjax 1.1. A recent paper [22] from 
IBM proposed another protocol for establishing a secure 
channel over the fragment identifier channel. They de- 
scribe their protocol as follows: 


The SMash library in the mashup applica- 
tion creates the secret, an unguessable random 
value. When creating the component, it in- 
cludes the secret in the fragment of the com- 
ponent URL. When the component creates the 
tunnel iframe it passes the secret in the same 
manner. 


The SMash developers have contributed their code to the 
OpenAjax project, which plans to include their fragment 
identifier protocol in version 1.1. The SMash protocol 
can be understood as follows: 


A— B: N,URI4 
B-—A:N 
A— B: N,Message, 


This protocol admits the following simple attack: 


Attacker — Gadget : N, URI; 
Gadget — Integrator : NV 
Attacker — Gadget : N, Message 


We have confirmed this attack by implementing the at- 
tack against the SMash implementation. Additionally, 
the attacker is able to conduct this attack covertly by 
blocking the message from the gadget to the integrator 
because the message waits for the 1oad event to fire. 


Secure Fragment Messaging. The fragment identifier 
channel can be secured using a variant of the Needham- 
Schroeder-Lowe protocol [23]. The main idea in Lowe’s 
improvement of the Needham-Schroeder protocol is that 
the responder must include his identity in the second 
message of the protocol, letting the honest initiator deter- 
mine that an attack is in progress and abort the protocol. 


Integrator — Attacker : 
Attacker — Gadget : 
Gadget — Integrator : 


Integrator — Attacker 


Ny, URI; 
Ny, URI; 
Nr,Ne 


: Na, Message, 


A-B: 
BoA: 
A—B: 


A—B: 


Na, URIa4 
Na, Np, URIp 
Ng 


Na, Np, Message, 


After these four messages, the attacker possesses NV; and 
Ng and can impersonate the integrator to the gadget. 
We have successfully implemented this attack against the 
Windows Live Contacts gadget. The issue is easily ob- 
servable for the Contacts gadget because the gadget dis- 
plays the integrator’s host name to the user in its security 
user interface; see Figure 3. 


USENIX Association 


BoA: Na, Nz, Message, 


We contacted Microsoft, IBM, and the OpenA- 
JAX Alliance about the vulnerabilities in their frag- 
ment identifier messaging protocols and suggested 
the above protocol improvement. Microsoft adopted 
our suggestions and deployed a patched version of 
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Figure 4: Recursive Mashup Attack 


Microsoft.Live.Channels and of the Windows 
Live Contacts gadget. IBM adopted our suggestions and 
revised their SMash paper. The OpenAJAX Alliance 
adopted our suggestions and updated their codebase. All 
three now use the above protocol to establish a secure 
channel using fragment identifiers. 


4.2 The postMessage Channel 


HTML 5 [19] specifies a new browser API for asyn- 
chronous communication between frames. Unlike the 
fragment identifier channel, the postMessage chan- 
nel was designed for cross-site communication. The 
postMessage API was originally implemented in 
Opera 8 and is now supported by Internet Explorer 8, 
Firefox 3 [37], and Safari [24]. 


Mechanism. To send a message to another frame, the 
sender calls the postMessage method: 


frames[0].postMessage ("Hello world."); 


The browser then generates a message event in the 
recipient’s frame that contains the message, the ori- 
gin (scheme, port, and domain) of the sender, and a 
JavaScript pointer to the frame that sent the message. 


Security Properties. The postMessage channel 
guarantees authentication, messages accurately identify 
their senders, but the channel lacks confidentiality. Thus, 
postMessage has almost the “opposite” security prop- 
erties as the fragment identifier channel. Where the frag- 
ment identifier channel has confidentiality without au- 
thentication, the postMessage channel has authenti- 
cation without confidentiality. The security properties 
of the postMessage channel are analogous to a chan- 
nel on a untrusted network secured by an existentially 
unforgeable signature scheme. In both cases, if Alice 
sends a message to Bob, Bob can determine unambigu- 
ously that Alice sent the message. With postMessage, 
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the origin property accurately identifies the sender; 
with cryptographic signatures, verifying the signature 
on a message accurately identifies the signer of the 
message. One difference between the channels is that 
cryptographic signatures can be easily replayed, but the 
postMessage channel is resistant to replay attacks. In 
some cases, however, an attacker might be able to mount 
a replay attack by reloading honest frames. 


Attacks. Although postMessage is widely believed 
to provide a secure channel between frames, we show 
an attack on the confidentiality of the channel. A mes- 
sage sent with postMessage is directed at a frame, but 
if the attacker navigates that frame to attacker.com 
before the message event is generated, the attacker will 
receive the message instead of the intended recipient. 


e Recursive Mashup Attack. Suppose, for exam- 
ple, that an integrator embeds a frame to a gadget 
and then calls postMessage on that frame. The 
attacker can load the integrator inside a frame and 
carry out an attack without violating the descendant 
frame navigation policy. After the attacker loads the 
integrator inside a frame, the attacker navigates the 
gadget frame to attacker.com. Then, when the 
integrator calls postMessage on the “gadget’s” 
frame, the browser delivers the message to the at- 
tacker whose content now occupies the “gadget’s” 
frame; see Figure 4. The integrator can prevent this 
attack by “frame busting,” i.e., by refusing to render 
the mashup if top !== self, indicating that the 
integrator is contained in a frame. 


e Reply Attack. Another postMessage idiom is 
also vulnerable to interception, even under the child 
frame navigation policy: 


window.onmessage = function(e) { 
if (e.origin == "https://b.com") 
e.source.postMessage (secret) ; 
}; 
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(b) Integrator’s reply is delivered to attacker 


Figure 5: Reply Attack 


The source attribute of the MessageEvent is 
a JavaScript reference to the frame that sent the 
message. It is tempting to conclude that the re- 
ply will be sent to https://b.com. How- 
ever, an attacker might be able to intercept the 
message. Suppose that the honest gadget calls 
top.postMessage ("Hello"). The gadget 
attacker can intercept the message by embedding 
the honest gadget in a frame, as depicted in Fig- 
ure 5. After the gadget posts its message to the 
integrator, the attacker navigates the honest gad- 
get to https://attacker.com. (This naviga- 
tion is permitted under both the child and descen- 
dant frame navigation policies.) When the integra- 
tor replies to the source of the message, the mes- 
sage will be delivered to the attacker instead of to 
the honest gadget. 





Securing postMessage. It might be feasible for sites 
to build a secure channel using postMessage as an 
underlying communication primitive, but we would pre- 
fer that postMessage provide a secure channel na- 
tively. In MashupOS [39], we proposed a new browser 
API, CommRequest, to send messages between ori- 
gins. When sending a message using CommRequest, 
the sender addresses the message to a principal: 


var req = new CommRequest () ; 

req.open ("INVOKE", 
"local:https://b.com//inc") ; 

req.send("Hello"); 


Using this interface, CommRequest protects the confi- 
dentiality of messages because the CommServer will 
deliver messages only to the specified principal. AlI- 
though CommRequest provides adequate security, the 
postMessage API is further along in the standard- 
ization and deployment process. We therefore propose 
extending the postMessage API to provide the addi- 
tional security benefits of CommRequest by including 
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a second parameter: the origin of the intended recipi- 
ent. If the sender specifies a target origin, the browser 
will deliver the message to the targeted frame only if that 
frame’s current security origin matches the argument. 
The browser is free to deliver the message to any prin- 
cipal if the sender specifies a target origin of *. Using 
this improved API, a frame can reply to a message using 
the following idiom: 


window.onmessage = function(e) { 
if (e.origin == "https://b.com") 
e.source.postMessage(secret, 
e.origin); 


}; 


As shown in this example use, the API uses the same 
origin syntax for both sending and receiving messages. 
The scheme is included in the origin for those develop- 
ers who wish to defend against active network attackers 
by distinguishing between HTTP and HTTPS. We imple- 
mented this API change as a patch for Safari and a patch 
for Firefox. Our proposal was accepted by the HTML 5 
working group [17]. The new API is now included in 
Firefox 3 [38], Safari [32], and Internet Explorer 8 [25]. 


5 Related Work 


Mitigations for Gadget Hijacking. SMash [22] mit- 
igates gadget hijacking (which the authors refer to as 
“frame phishing”) without modifying the browser by 
carefully monitoring the frame hierarchy and browser 
events for evidence of unexpected navigation. Neither 
the integrator nor the gadget can prevent these naviga- 
tions, but the mashup can alert the user and refuse to 
function if it detects an illicit navigation. This approach 
lets an attacker mount a denial-of-service attack against 
the mashup, but a web attacker can already mount a 
denial-of-service attack against the entire browser by is- 
suing a blocking XMLHttpRequest or entering an in- 
finite loop. 
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Unfortunately, this approach can lead to false posi- 
tives. SMash waits 20 seconds for a gadget to load before 
assuming that the gadget has been hijacked and warning 
the user. An attacker might be able to fool the user into 
entering sensitive information during this time interval. 
Using a shorter time interval might cause users with slow 
network connections to receive warnings even though no 
attack is in progress. We expect that the deployment of 
the descendant policy will obviate the need for server- 
enforced gadget hijacking mitigations. 


Safe Subsets of HTML and JavaScript. One way to 
sidestep the security issues of frame-based mashups is to 
avoid using frames entirely and render the gadgets to- 
gether with the integrator in a single document. This 
approach forgoes the protections of the browser’s se- 
curity policy because all the gadgets and the integra- 
tor share a single browser security context. To main- 
tain security, this approach requires gadgets to be writ- 
ten in a “safe subset” of HTML and JavaScript that pre- 
vents a malicious gadget from attacking the integrator or 
other gadgets. Analyzing the security and usability of 
these subsets is an active area of research. Several open- 
source [13, 4] and closed-source [31, 10] implementa- 
tions are available. FBML [10] is currently the most suc- 
cessful of these subsets and is used by millions of users 
as the foundation of the Facebook Platform. 

Writing programs in one of these safe subsets is often 
awkward because the language is highly constrained to 
avoid potentially dangerous features. To improve usabil- 
ity, the safe subsets are often accompanied by a com- 
piler that transforms untrusted HTML and JavaScript 
into the subset, possibly at the cost of performance. 
These safe subsets will become easier to use over time 
as these compilers become more sophisticated and more 
libraries become available, but with the deployment of 
postMessage and the descendant policy, we expect 
that frame-based mashup designs will continue to find 
wide use as well. 


Other Frame Isolation Proposals. There are several 
other proposals for frame isolation and communication: 


e Subspace. In Subspace [21], we used a multi- 
level hierarchy of frames that coordinated their 
document .domain property to communicate di- 
rectly in JavaScript. Similar to most frame-based 
mashups, the descendant frame navigation policy is 
required to prevent gadget hijacking. 


e Module Tag. The proposed <module> tag [5] 
is similar to an <iframe> tag, but the module 
runs in an unprivileged security context, without a 
principal, and the browser prevents the integrator 
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from overlaying content on top of the module. Un- 
like postMessage, the communication primitive 
used with the module tag is intentionally unauthen- 
ticated: it does not identify the sender of a message. 
It is unknown whether navigation can be used to in- 
tercept messages as there are no implementations of 
the <module> tag. 


Security=Restricted and Jail. Internet Ex- 
plorer supports a security attribute [26] of 
frames that can be set to restricted. With 
security="restricted", the frame’s con- 
tent cannot run JavaScript. Similarly, the pro- 
posed <jail> tag [8] encloses untrusted content 
and prevents the sandboxed content from running 
JavaScript. However, eliminating JavaScript pre- 
vents gadgets from offering interactive experiences. 


MashupOS. Our MashupOS proposal [39] includes 
new primitives for isolating web content while al- 
lowing secure communication. Our improvements 
to postMessage and frame navigation policies 
allow web authors to obtain some of the benefits of 
MashupOS using existing web APIs. 


6 Conclusions 


Web browsers provide a platform for web applica- 
tions. These applications rely on the browser to isolate 
frames from different security origins and to provide se- 
cure inter-frame communication. To provide isolation, 
browsers implement a number of security policies, in- 
cluding a frame navigation policy. The original frame 
navigation policy, the permissive policy, admits a number 
of attacks. The modern frame navigation policy, the de- 
scendant policy, prevents these attacks by permitting one 
frame to navigate another only if the frame could draw 
over the other frame’s region of the screen. The descen- 
dant policy provides an attractive trade-off between secu- 
rity and compatibility, is deployed in the major browsers, 
and has been standardized in HTML 5. 

In existing browsers, frame navigation can be used as 
an inter-frame communication channel with a technique 
known as fragment identifier messaging. If used directly, 
the fragment identifier channel lacks authentication. To 
provide authentication, Windows. Live.Channels, 
SMash, and OpenAjax 1.1 use messaging protocols. 
These protocols are vulnerable to attacks on authentica- 
tion but can be repaired in a manner analogous to Lowe’s 
variation of the Needham-Schroeder protocol [23]. 

The postMessage communication channel suffered 
the converse security vulnerability: using frame navi- 
gation, an attacker can breach the confidentiality of the 
channel. We propose providing confidentiality by ex- 
tending the postMessage API to let the sender specify 
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an intended recipient. Our proposal was adopted by the 
HTML 5 working group, Internet Explorer 8, Firefox 3, 
and Safari. 

With these improvements to the browser’s isolation 
and communication primitives, frames are a more attrac- 
tive feature for integrating third-party web content. Two 
challenges remain for mashups incorporating untrusted 
content. First, a gadget is permitted to navigate the top- 
level frame and can redirect the user from the mashup to 
a site of the attacker’s choice. This navigation is made 
evident by the browser’s location bar, but many users 
ignore the location bar. Improving the usability of the 
browser’s security user interface is an important area of 
future work. Second, a gadget can subvert the browser’s 
security mechanisms if the attacker employs a browser 
exploit to execute arbitrary code. A browser design that 
provides further isolation against this threat is another 
important area of future work. 
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Abstract 

Cross-site scripting (XSS) and SQL injection errors 
are two prominent examples of taint-based vulnerabil- 
ities that have been responsible for a large number of 
security breaches in recent years. This paper presents 
QED, a goal-directed model-checking system that auto- 
matically generates attacks exploiting taint-based vulner- 
abilities in large Java web applications. This is the first 
time where model checking has been used successfully 
on real-life Java programs to create attack sequences that 
consist of multiple HTTP requests. 

QED accepts any Java web application that is writ- 
ten to the standard servlet specification. The analyst 
specifies the vulnerability of interest in a specification 
that looks like a Java code fragment, along with a range 
of values for form parameters. QED then generates a 
goal-directed analysis from the specification to perform 
session-aware tests, optimizes to eliminate inputs that 
are not of interest, and feeds the remainder to a model 
checker. The checker will systematically explore the re- 
maining state space and report example attacks if the vul- 
nerability specification is matched. 

QED provides better results than traditional analyses 
because it does not generate any false positive warnings. 
It proves the existence of errors by providing an exam- 
ple attack and a program trace showing how the code is 
compromised. Past experience suggests this is important 
because it makes it easy for the application maintainer to 
recognize the errors and to make the necessary fixes. In 
addition, for a class of applications, QED can guarantee 
that it has found all the potential bugs in the program. 
We have run QED over 3 Java web applications totaling 
130,000 lines of code. We found 10 SQL injections and 
13 cross-site scripting errors. 


1 Introduction 


As more and more business applications migrate to the 
Web, the nature of the most dangerous threats facing 
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users has changed. Web applications are typically writ- 
ten in languages that make classic exploits like buffer 
overruns impossible, but new infrastructures bring new 
vulnerabilities. Two of the most popular attacks in 
this domain are SQL injection and cross site script- 
ing (XSS) [12]. This paper presents a practical, pro- 
grammable technique that can automatically generate at- 
tacks for large web-based applications. The system also 
shows the statements executed over the course of the at- 
tack. This information can be used by application devel- 
opers to close these security holes. 


Many commercial systems, including Cenzic’s Hail- 
storm [7] and Core Security’s Core Impact [9], rely on 
black-box testing. In black-box testing of web applica- 
tions, the tester only has the level of access available to 
any external attacker—that is, it may only make HTTP 
requests and examine the responses. This approach has 
the advantage that any such analysis is independent of 
the target application’s implementation language, mak- 
ing it ideal for broad deployment. However, it cannot 
take advantage of the logic of the program; it may not be 
efficient, and it cannot provide any guarantee on cover- 
age. 

This paper presents a system called QED that auto- 
matically finds attack vectors for a large class of vul- 
nerabilities in web applications written in the same ap- 
plication framework. This system is based on the ap- 
proach of concrete model checking. This is a verifica- 
tion technique based on systematic exploration of a pro- 
gram’s state space. It is an attractive approach to security 
problems because not only can it conclusively find vul- 
nerabilities, if a systematic exploration proves exhaus- 
tive, it can prove that no vulnerabilities exist. However, 
this technique is generally not feasible for large, real-life 
programs. In addition, a web application continuously 
accepts inputs, so it seems impossible on the surface to 
exhaust all possible paths. To make QED a practical tool 
that works on real programs, we built the system based 
on the design principles listed below. 
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1. Many web application vulnerabilities, such as SQL 


injection and cross-site scripting, can be generalized 
as taint-based problems. By focusing on this class 
rather than one vulnerability at a time, the QED sys- 
tem is much more general. Users can specify taint- 
based vulnerabilities in a language called PQL [22]. 
In fact, PQL extends beyond even taint-based anal- 
ysis as it includes execution patterns involving any 
sequence of methods on a set of objects that is de- 
scribable via a context-free language. 


Users can use QED for finding different vulnera- 
bilities, and even vulnerabilities that are specific to 
their own applications. It is very important that or- 
dinary developers be able to generate these analyses 
on their own. 


. Today, application frameworks are heavily used in 


web application development as they greatly re- 
duce software engineering time. We advocate ex- 
tending the notion of frameworks beyond software 
development to include code auditing. Exploiting 
higher level semantic information about the frame- 
work makes it possible to generate more effective 
static analyses. Furthermore, by abstracting away 
the guts of a framework, we can concentrate our 
model checker’s effort on the application code it- 
self. This abstraction step needs only to be per- 
formed once for each framework, as the abstracted 
code is reusable. For this research, we have picked 
the following popular core frameworks for web ap- 
plications: 


e Java servlets [27], which is a standard exten- 
sion to the Java platform for writing web ap- 
plications. 


e JSPs (Java Server Pages) [28], which allow 
page design to be commingled with database 
accesses. 


e Apache Struts [1], which is a web appli- 
cation framework that uses the model-view- 
controller paradigm. In this paradigm, a con- 
troller decouples the data model from the user 
view so they can easily be changed indepen- 
dently. 


Any Java web application intended for deployment 
in a standard application server conforms to the 
servlet specification. If a Java web application also 
uses JSP or Struts, our framework will take advan- 
tage of the additional semantics as well. 


To demonstrate the effectiveness of this approach, 
we report the result of applying our tool across three 
different Java web applications developed on this 
framework. 
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3. In model checking, we are simulating the pro- 


gram execution on candidate input sequences. QED 
uses JPF, the Java PathFinder model checking sys- 
tem [29], to do this. It is important that we con- 
centrate the model checking time on sequences that 
are likely to identify vulnerabilities. Based on the 
query, QED automatically compiles user-supplied 
queries into static analyses for the web application 
that prune out input sequences that are guaranteed 
not to expose any vulnerability. The static analysis 
generates a set of input vectors. If it is small, this set 
can be tested exhaustively; if it is not, the static anal- 
ysis’s results—directed by the user’s query—direct 
the checker to test more promising results first. 


1.1 Contributions 
This paper makes the following contributions. 


e A session-based model for user input in web appli- 


cations. Much work in testing web applications fo- 
cuses on either analyzing individual pages [31] or 
simulating a browser user with a sophisticated spi- 
der [3]. We present a technique that bases its user 
model on data flow information across requests in a 
session. This helps restrict the search space while 
also exposing possible vulnerabilities that a spider 
or nonmalicious end user might never produce. 


A programmable approach to checking event-driven 
applications. QED is extremely flexible; its concept 
of vulnerability is merely “anything that matches a 
specification”, and the permissible specifications in- 
clude any context-free language of method calls on 
a consistent set of run-time objects. Though this 
paper focuses on taint vulnerabilities in web appli- 
cations, the technique generalizes to other error pat- 
terns as well as other event-based systems such as 
GUI applications or file systems. 


A model-checking framework to systematically ex- 
plore standard Java web applications. We have 
implemented a simulated environment for the Java 
PathFinder model checker that will systematically 
explore programs based on the Java Servlet Spec- 
ification. We have refined it further to work more 
effectively with the popular Apache Struts frame- 
work. 


Experimental validation of our approach. We sup- 
plied specifications for two major security vulnera- 
bilities (cross-site scripting and SQL injections) and 
applied the QED system to three large Web applica- 
tions. These applications totaled roughly 130,000 
lines of non-library code. QED detected 10 SQL 
injection vulnerabilities and 13 XSS vulnerabilities. 


USENIX Association 


1.2 Paper Organization 


Section 2 describes the class of vulnerabilities of interest. 
Section 3 describes how we apply model checking to web 
applications to generate the attack vectors and get the ex- 
ecution trace. Section 4 describes how we use static anal- 
ysis to reduce the search space of model checking. Sec- 
tion 5 demonstrates the QED algorithm step by step on 
an example application. Section 6 details experimental 
results. Section 7 discusses related work, and Section 8 
concludes. 


2 Problem Statement 


Our algorithm accepts a web application and a vulner- 
ability specification, then generates a set of attack path 
components with corresponding execution traces. This 
section describes the class of applications and vulnera- 
bilities our system addresses. 


2.1 Taint Vulnerabilities 


SQL injection and cross-site scripting are both instances 
of taint vulnerabilities. All such vulnerabilities are de- 
tected in a similar manner: untrusted data from the user 
is tracked as it flows through the system, and if it flows 
unsafely into a security-critical operation, a vulnerabil- 
ity is flagged. In SQL injection, the user can add addi- 
tional conditions or commands to a database query, thus 
allowing the user to bypass authentication or alter data. 
With XSS, an attacker can inject his own HTML (includ- 
ing JavaScript or other executable code) into a web page; 
this is exploitable in many ways, up to complete com- 
promise of the browser. In the so-called “reflection at- 
tack” [12] XSS is used by a phisher to inject credential- 
stealing code into official sites without having to redirect 
the user to a copy of the site. This means that any secu- 
rity credentials will be valid on the attack site, and even 
whitelisting will not prevent the attack. 

Given the gravity of the vulnerabilities, we would like 
to eliminate their existence before deploying our applica- 
tions. Some of these vulnerabilities can be subtle, how- 
ever. It is not sufficient to just consider URLs in isolation 
because an attack may consist of a sequence of URLs. 
Consider a scenario with the example web application in 
Figures | and 2. An attack on this application can go 
as follows: the attacker sends the victim an email con- 


taining the URL http://example.com/search_ 


begin. jsp?s=<script... where the s parameter 
carries a JavaScript payload crafted to log users’ key- 
board entries. The victim clicks on the link. Since this 
is the user’s first interaction with example.com, anew 
session is created by the web server, and when the JSP 
checks the value of login, it finds nothing. It thus stores 
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<html> 
<head> 
<% HttpSession s = getSession(); 
if (s.getAttribute("login") = 
s.setAttribute("text", 
getParameter("S"); %> 
<meta http-equiv="refresh" 
content="10; URL=search_login. jsp"> 
</head> 
<body></body> 
</html> 
<% } else { %> 
<!—- rest of page... -—-> 


= null) { 


Figure 1: Snippet from search_begin.jsp. 


<html> 

<body> 

<hl>Login required</h1> 

<p>To search for 
<s=getSession().getAttribute ("text") >, 
you must first log in.</p> 

<form> 

<!—- rest of page... -—-> 


Figure 2: Snippet from search_login.jsp. 


the search string in the session and generates a redirect 
page to search_login.jsp. That page then generates 
an error and requests login information. However, at this 
point it echoes the value from the session blindly, thus 
injecting the script and allowing the attacker to log the 
user’s password. This example illustrates that we need to 
analyze more than just individual requests to be sure we 
have found all vulnerabilities in a web application. 

We model the behavior of a web application as a series 
of request-response events; each URL corresponds to an 
HTTP request, and this request is processed to produce 
a response. We may characterize an attack vector by a 
sequence of URL requests in a session where untrusted 
input data propagates into security-critical operations. 


2.2 Domain of Web Applications 


We model a web application as a reactive system that 
operates on a session at a time. A session consists of a 
series of events, with each event being an HTTP request 
submitted by the same user. Note that while the request 
originates from the same user, its contents may actually 
be manipulated by an attacker. We do not place any re- 
striction on the ordering of events. In particular, it is not 
necessary that requests be constrained by the links avail- 
able on the last page viewed. This is necessary because 
an attacker can construct and send malicious requests di- 
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rectly. This also argues against using web-spider tech- 
niques to collect potential attack vectors. 

In response to an event, a web application may modify 
the session data. This is information that is user-specific 
but maintained temporarily on the webserver over the 
course of a user’s interaction with the machine. In a web- 
server, a separate data structure is normally maintained 
for each user, and cookies or special arguments would be 
set to match each users to their sessions. 

Sessions are assumed to be independent of each other. 
An attack may consist of a sequence of events within a 
session, but cannot span multiple sessions. Our reason- 
ing here is that any attack usable against another user 
should also be usable against oneself, and so the attack 
will still manifest. 


2.3 Vulnerability Specifications 


The set of taint-based vulnerabilities addressed by our 
technique consists of all attacks that match the following 
pattern: 


1. Untrusted data is read in from some faint source, 
such as a user-controlled file, URL request, cookie 
value, or network source. It may subsequently be 
stored in arbitrary objects and passed in and out as 
parameters or returned results. 


2. Some methods may derive new objects from old. 
Some of these, if passed an untrusted object, will 
produce an untrusted object. Examples include 
methods that parse a request and create subobjects 
from the untrusted data, or methods that create 
larger strings by appending characters to the un- 
trusted data. We call these methods propagators. 


3. No untrusted data, whether from the original taint 
source or derived via propagators, may be used in 
any taint sink, such as a database access routine. 


4. The previous rule does not apply if the object has 
been passed through one of several sanitizers, that 
quote or escape the contents of the object. 


This is an abstraction of the general problem of in- 
formation flow control. Information is tracked from the 
source, through propagators, until it either hits a sanitizer 
and becomes safe, or hits a taint sink and possibly does 
damage. Once the tracker can confirm that all dangerous 
data only reaches sanitizers, a proof of the correctness 
of these sanitizers will suffice to prove the correctness of 
the entire program. 

Our vulnerability specification consists of four pat- 
terns, one for each of the previously enumerated compo- 
nents. These patterns are expressed as PQL queries. PQL 
is a powerful specification language that permits one to 
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query source (object * x) 
matches 
HttpServletRequest .getParameter (x) 
| x = Cookie.getValue(); 


query prop(object * x, 
matches 
(StringBuilder) y.append (x) 
| y = (StringBuilder) x.toString()j; 


object xy) 


query sink (object * x) 
matches 
JspWriter.print* (x) 
| JspWriter.write(x, ...); 


Figure 3: XSS vulnerability specification. 


specify patterns of events on objects in a manner simi- 
lar to program snippets. It permits subqueries to be de- 
fined and then matched against as well. We can exploit 
this by defining the components of our specification as 
subqueries and then linking them together with a generic 
main query that works for any taint problem. 

A simple example for XSS in JSPs is shown in Fig- 
ure 3. All three of its defined subqueries are a logical 
OR between individual method calls. Its taint sources, 
HttpRequest.getParameter and Cookie.getValue, 
are defined for all Java web applications [27]. Likewise, 
the JspWriter class in the taint sink is defined in the 
JSP specification [28]. PQL permits method names to 
be regular expressions, and so we collect all print and 
printin method calls within a single clause. 

The propagation rules in the prop query handle string 
concatenation in Java 1.5. In the full specification, other 
versions of Java and other modes of string propagation 
are also handled. These are simply added as additional 
OR clauses; we omit them here for clarity. 

Care must be taken when developing the 
specification—missing a propagator may lead to 
false negatives in the final result, while missing san- 
itizers is likely to lead to many false positives. A 
suitably crafted general specification, however, can 
apply to many applications directly or with only minor 
modifications to specify details and application-specific 
sanitizers. Furthermore, the operation of the model 
checker will suggest which modifications need to be 
made to refine the query. 

Due to the design of the Java libraries, web application 
queries will rarely need to explicitly specify sanitizers. 
Java’s String class is immutable, and it is also the class 
that represents the beginning and end points of any web 
transaction. Since the sanitization process will generally 
create an entirely new String, this freshly created object 
would thus be considered safe. This is another reason we 
must be particularly careful not to miss any propagators: 
any propagator we fail to specify will be treated as a san- 


USENIX Association 


Application 

cu) i PQL 
PQL L 

; Instrumenter 
—) rm 








| 


0 Input 
Parameters 





Generator 


| 


Goal-Directed 
Optimizer 


| 


Model QED 
Checker Libs 


Attack Paths 


QED 


























Analyst 


Figure 4: QED architecture. User-supplied information 
is on the left. 


itizer. 

It is also possible that a sanitizer might perform its 
transforms using propagator methods. This would re- 
quire explicitly marking the result as sanitized. How- 
ever, this situation never occurred in our experiments. 
We never found it necessary to explicitly specify sani- 
tizers, and our XSS query worked unmodified with all 
applications. 


2.4 PQL Instrumentation and Matching 


The vulnerability specification is translated by the PQL 
compiler into a set of instrumentation directives. When 
applied to the target application, they weave in monitor- 
ing code to detect matches to the query, and to report on 
the objects involved [22]. When a match is found by the 
monitor, it signals the model checker to report that a fail- 
ure condition has been found. If no match occurs, the 
model checker’s backtracking mechanism will also roll 
back the matching machinery to the appropriate state. 
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3 Input Generation 


In this section, we describe how QED enumerates attack 
vectors for a target web application. An analyst must 
provide two components: the PQL query specifying the 
vulnerability, and a set of input values for any form pa- 
rameters. Given these, QED will do the rest. A diagram 
of the process is shown in Figure 4. 

The input application is first instrumented according 
to the provided PQL query, as described in Section 2.4. 
The instrumented application is then combined with a 
custom, automatically generated harness. This is a pro- 
gram that will systematically explore the space of URL 
requests. Each URL consists of a page request (the path, 
covered in Section 3.1), and an optional set of input pa- 
rameters (the query, discussed in Section 3.2). The har- 
nessed application is then fed to the model checker, along 
with stub implementations of the application server’s en- 
vironment. The results of that model checker correspond 
directly to sequences of URLs that demonstrate the at- 
tack paths. 

We may also optionally improve our search by opti- 
mizing the harness before the model checking step; we 
discuss these refinements in Section 4. 


3.1 Generating Page Requests 


An attack path is a sequence of URLs, each of which 
consists of a page request (the path) and a set of input pa- 
rameters (the query) [4]. The web application translates 
a URL into a method invocation with a set of parameters. 

Thus, a URL corresponding to our sample JSP earlier: 


http : //www.example.com/search.jsp?s = foo 
would translate into the method invocation 
org.apache. jsp.search_jsp.doGet(req, resp) 


where the call req.getParameter(x) yields the value 
“foo” if x is “s”, and yields null otherwise. The resp 
parameter represents the response to be returned. 

There is a simple correspondence between a URL and 
a method invocation. We refer to the method invocation 


as an event. An event consists of: 


1. a reference to an event handler. The event handler 
corresponds to the path of the URL. It identifies 
the name of the Java method to be invoked when 
a matching URL is received. 


2. event handler parameters. These typically corre- 
spond to the query part of the URL. They provide 
extra parameters used by the handler, and generally 
carry the more free-form data. They may include 
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cookies or information supplied by a user when fill- 
ing out forms. They may also contain the pay- 
load that an attacker wishes to inject into the sys- 
tem. Thus, it is very important to model these inputs 
carefully. 


Most Java web applications are developed using a 
framework that makes explicit the set of URLs it accepts 
and its corresponding event handlers. Our system cur- 
rently handles three popular frameworks, as discussed 
below. 


e Servlets are the most basic form of server-side Java, 
and are the lowest level of abstraction available. 
Any Java web application intended to be run in a 
standard application server must ultimately use this 
specification. Individual servlets are Java classes 
that implement a well-specified API [27]. The 
URL-to-servlet mapping is specified by an XML 
file as part of the application’s metadata. QED sim- 
ply interprets the XML file to determine the list of 
event handlers in the application. 


e Java Server Pages, or JSPs, provide a PHP-like in- 
terface to Java [28]. They are compiled by a JSP 
compiler such as Jasper into servlets. The URL-to- 
servlet mapping in this case is specified by a trans- 
formation of the JSP’s path in the file system, which 
generates the class name. 


e Apache Struts is a popular application platform 
built on top of JSPs and the core servlet specifica- 
tion [1, 13]. Itimplements its own Action API sim- 
ilar to the servlets API, but which forwards to JSP 
files for actual HTML output. A URL in a Struts ap- 
plication thus maps to two calls in sequence; a call 
to an Action’s entry point, and a call to the associ- 
ated JSP’s entry point. These mappings from URLs 
to Actions and JSPs are specified in an XML file in 
a manner similar to the specifications for servlets. 


For each of these, QED can produce a comprehen- 
sive list of paths understood by the application. To test 
each sequence, it does, by default, a breadth-first search 
through them - first checking all sequences of length 1, 
then all of length 2, and so forth. This has no obvious 
termination condition, however; our optimizations and 
heuristics in Section 4 provide limits. 


3.2 Parameters to Event Handlers 


In Java web applications, data from the user is repre- 
sented by a set of key-value pairs mapping strings to 
strings. Applications conforming to the Java Servlet 
Specification use a method called getParameter to re- 
trieve a value for a given key. QED rewrites methods 
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corresponding to taint sources to call out to the model- 
checker, indicating to the model checker that there is 
non-determinism associated with the returned value of 
the method. The model checker will cycle through the 
possible values, including the option that no such key 
was provided by the user. 

We rely on the analyst to provide a sufficient pool of 
values to test the application. It would be infeasible to 
test every possible string that could be supplied to the 
event handlers, but it is also not necessary. Our goal is 
merely to show that it is possible for data from a taint 
source to reach a taint sink. If a controlled string is dis- 
played, this is a vulnerability. 

In cases where the contents of an input string do mat- 
ter, the data are often expected to be in a certain form: 
if they do not conform to the expected type, some paths 
may not be executable. For our experiments, we supplied 
one of the common default types used by web applica- 
tions in general: integers, booleans (“yes’, “true”, etc.) 
and generic strings. We also included the null object to 
represent the lack of an argument. 

Applications may also require application-specific 
“magic” values that influence control flow. The most 
common case for this is an action variable or similar, 
which holds one of several values depending on the value 
of a list box or similar. In such cases, QED can usually 
extract the information we need via a constant propaga- 
tion analysis; this will tell us if an argument from the 
query string is compared against constant strings. By 
enumerating these strings and ensuring they are possi- 
ble values for our keys, we search the input space more 
exhaustively. 

It would be possible to combine this work with an an- 
alysis similar to EXE [6] to determine a set of inputs that 
would exercise all predicates in the web application. For 
our experiments so far, however, we have found that even 
our simple constant-propagation analysis is overkill. Al- 
most all data read from the user is processed and dumped 
directly into a data sink. In these circumstances the con- 
trol flow cannot change based on input. 


4 Goal-Directed Optimization 


In this section, we present several optimizations to re- 
duce the search space of model checking. The key in- 
sight is that the we should not treat all URL sequences as 
equally likely to yield a new vulnerability, since we may 
have already checked a shorter, equivalent sequence. 
Since we check in increasing order of length, any match 
it finds will have already been discovered. There are four 
principles we apply to focus the search: 


e The final request in the sequence must finish the 
demonstration of a vulnerability (Section 4.1). 
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e Every request must, directly or indirectly, influence 
the final result (Section 4.2). 


e No sequence ever repeats a request (Section 4.3). 


e A match can only occur in a sequence if there are 
objects that would satisfy that match participating 
in that sequence (Section 4.4). 


4.1 Filtering Final Events 


QED’s model checker searches through candidate se- 
quences in length order. This means that for any given 
vulnerability in the code, the shortest demonstration of it 
will appear first. If it does not, any possible vulnerability 
would have already been shown before the final request 
was processed, so a prefix of the sequence would suffice, 
and will in fact have already been checked. This condi- 
tion is thus stronger than a simple breadth-first search, 
which can only confidently eliminate sequences with a 
prefix corresponding to a known vulnerability. 

To perform the final event filter, we need two pieces of 
information. First, we need to know which method calls 
in the application can in fact complete a match. For a 
taint problem, this is straightforward, as it is any method 
listed as a taint sink. For PQL in general it may be nec- 
essary to perform a simple control-flow analysis on the 
query to determine the set of events that can occur last. 

We then need to determine which URL requests can 
lead to match completion. We do this by writing a sim- 
ple harness program that calls each entry point in the ap- 
plication in turn. We then compute a call graph of this 
harness and determine which entry points can eventually 
call a match-completing method. 

Any sequence which does not end in a call to one of 
these entry points is guaranteed to not affect the final re- 
sult, and thus may be discarded. 


4.2 Eliminating Redundant URL Se- 


quences 


HTTP is a stateless protocol. Web applications main- 
tain state across requests either client-side with cook- 
ies or server-side with session data. We treat cookies 
as a source of user input, as cookie information may 
be forged, deleted mid-session, or otherwise tampered 
with. Session information remains under the control of 
the server and can thus be tracked more precisely. 

The motivation behind this optimization is that this 
mechanism is the sole form of data-flow through the ses- 
sion. If there is no data-flow contributed by a part of a 
candidate sequence, we need not include that part. Fur- 
thermore, since we are checking in increasing order of 
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length, removing this redundant part of the sequence pro- 
duces a sequence that we have already either checked or 
proven irrelevant. 

To perform this optimization we need a way to char- 
acterize the cross-request data-flow. We do this via a 
dependence relation: an event handler m, depends on 
another event handler mz if ™m, can potentially read the 
data written by mz. To compute the dependence relation, 
we must determine the flow of data within a session. 

The Java Servlet Specification provides an explicit 
API to capture this. Data are passed between handlers via 
a special object of type javax.servlet.HttpSession. 
This session object functions as a string-to-object map. 
For each request, we determine what string values can 
be used as keys to the map for reads and writes. This 
information is available via a call graph analysis as in 
Section 4.1, supplemented with pointer and constant- 
propagation information to determine which string val- 
ues may be used as keys. If a nonconstant string is used 
as a key, we assume that handler may access anything in 
the session. 

With this information we can compute the dependence 
relation by treating each key as a storage location and 
determining def-use information. We then take the tran- 
sitive closure of the dependence relation, and eliminate 
any sequence in which there are requests that do not in- 
fluence the final request. 


4.3 Removing Repetitive Cycles 


If the dependency relation is cyclic, there will be a count- 
ably infinite number of possible candidates to test. To 
keep the test sequence finite, we restrict our sequences to 
only call any given entry point once. 

This heuristic would need to be refined for web appli- 
cations where one physical page serves as multiple logi- 
cal pages (controlled, say, by some action parameter); 
however, this situation did not arise in any of our experi- 
ments. 


4.4 Statically Eliminating Sequences 


We further reduce the search space by using a static an- 
alysis to prune off sequences that cannot possibly match 
our query. This is especially important for sequences 
that use a large number of widely variable parameters, 
as eliminating a single sequence can translate into thou- 
sands or even millions of candidates that need not be 
checked. The algorithm is described below. 


1. QED constructs a new harness for the application 
that iterates through all sequences that pass the pre- 
ceding three criteria. The harness defines a method 
for each input sequence, and the method calls the 


17th USENIX Security Symposium 37 


38 


entry point for each of the URL request in the se- 
quence. 


2. QED translates the PQL query specifying the defect 
of interest into a sound context-sensitive interpro- 
cedural analysis that determines if the query can be 
satisfied. QED applies the analysis to the harness 
to find the methods (input sequences) that can po- 
tentially generate a match. The algorithm used has 
been been described in a previous paper [22]. This 
analysis tracks pointers in a context-sensitive but 
flow-insensitive manner. The analysis is sound— 
no approximation done by the pointer analysis will 
produce false negatives. All sequences found by the 
analysis to be incapable of generating a match may 
be ignored without compromising the soundness of 
the model checker. 


The success of this step hinges on both the precision 
and the conservativeness of the pointer analysis used. An 
overly imprecise analysis will not be able to eliminate 
any candidates, while a non-conservative analysis will 
prune away candidates that might be valid. The QED 
system applies the context-sensitive, conservative, inter- 
procedural, and inclusion-based analysis of Whaley and 
Lam [32], along with improvements by Livshits et al. 
to handle reflection [21]. The results of this analysis 
are stored in a deductive database which QED consults 
throughout the optimization process [19]. 


5 Example 


We will now show the operation of this algorithm by de- 
tecting an XSS vulnerability in a simple three-page web 
application. The pages in this application are as follows: 


e search.jsp, which presents a search form to the 
user and sends the results on to searching.jsp. 


e searching.jsp, which reads a search parameter s 
and stores it in the session. The display is a simple 
timed redirect to result.jsp. 


e result.jsp, which prints the results of the search. 
It also echoes the initial input, retrieved from the 
session. This represents a cross-site scripting vul- 
nerability. 


For our example, we use the stock XSS vulnerability 
query from Figure 3. The PQL instrumenter will trans- 
form the application, tracking all calls to sources, sinks, 
and propagators. 

For our model environment, we will only concern our- 
selves with whether or not an argument is present, so we 
will set null and “SampleString” as our input pool. 
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QED will generate a test harness for the application, pro- 
viding these values as plausible results for the sources, 
and calling all possible sequences of events. Since we 
only concider non-repeating sequences, there are ten: 
three of length 1, six of length 2, and one of length 3. The 
entry points for these events will simply be the doGet 
methods on the classes corresponding to each JSP. 

In the optimization step, the final events filter has no 
effect for this query. The sink for the XSS query is 
JspWriter.print(), which all three pages call as part 
of their output generation. 

The dependency criterion is much more fruitful. 
Our session-based def-use analysis concludes that 
searching.jsp writes the session, while result.jsp 
reads it with the same key. This yields a dependency re- 
lation with one fact, and the dependency criterion elim- 
inates all but four sequences—each page alone, and the 
[searching.jsp, result.jsp] sequence. Factoring in 
the choice of s in searching.jsp, this yields a grand 
total of five test runs. 

The pointer analysis phase shows that searching.jsp 
is the only request handler with a source in it, thus elim- 
inating two of the length-one sequences immediately. It 
can then show that, as searching.jsp’s parameter is 
only fed into a session and the handler itself only emits 
constant strings, the lone searching.jsp request also 
cannot complete a match. Thus, for our example appli- 
cation, we are able to pre-prune every sequence of events 
but one. The only task remaining for the model checker 
is to demonstrate which values for s, if any, will actually 
produce a vulnerability. 

The model checker will return the following sequence 
as a demonstration of an XSS attack path: 


e searching.jsp?s = SampleString 
e result.jsp 


Despite the fact that a typical use case would derive 
its input from search.jsp, the page does not actually 
contribute anything to the vulnerability itself. 

In general, the amount of search space that can be re- 
moved by our optimizations will depend on several fac- 
tors. The number and prevalence of taint sinks is one; 
if there are more places where the path can end, there 
will clearly be more paths. However, the dominant fac- 
tor will be the fan-out from the session data-flow. With a 
low fan-out, even a large number of sinks will not multi- 
ply unduly. 


6 Experimental Results 


We applied QED to three Struts-based web applications 
from the open-source repository Sourceforge. Basic in- 
formation about these is shown in Figure 5. They are 
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Benchmark Description Lines of | Classes Event Dependency 
Code os a 


PersonalBlog || Blogging software | 17,149 
PTorganier Adis book | SEB7[ 263-08 





[JGossip || Forum system | 79,685 356 80 | __267 


Figure 5: Applications used in the experiments. (The lines of code do not include library classes) 
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Figure 6: Analysis results. 


listed in order of their size. For each application, we list 
the number of classes defined in the program, the size of 
the application itself (not counting library classes), and 
the total number of event handlers specified by the appli- 
cation’s deployment metadata. The last column of Fig- 
ure 5 shows the number of dependency pairs found by 
our dependence analysis described in Section 4.2. 


We used QED to locate both cross-site scripting and 
SQL injection vulnerabilities in each of these applica- 
tions. Each of these applications depends on a database 
backend. The JGossip application used JDBC directly; 
the other two used object persistence libraries that we 
modeled as stubs. All three applications, since they are 
Struts-based, rely on JSPs for their output, and so the 
XSS analysis dealt primarily with those. 


Figure 6 presents some measurements of our experi- 
ment. The first column (Non-redundant URL Sequences) 
lists the number of sessions whose URLs are not re- 
peated and not redundant according to their data depen- 
dencies. Personalblog does not have cycles in its depen- 
dence graph, so it is possible to exhaustively model check 
the program by testing the specified number of input se- 
quences. The next column (Ends in SQL Sink) shows the 
result of applying the full redundancy elimination anal- 
ysis algorithm presented in Section 4.2. The next col- 
umn (SQL Sessions) shows the number of sessions that 
needs to be checked after the feasibility analysis from 
Section 4.4 is also taken into account. The next column 
gives the number of SQL injections QED discovered. 


The final two columns provide similar information for 
XSS. We do not provide an equivalent to the “Ends in 
SQL Sink” column because the XSS sink is HTML out- 
put, and so every HTTP response by definition includes 
a sink. Between SQL injection and cross-site scripting, 
we thus cover both rare and common sinks in our appli- 
cations. 


For comparison, even if we restrict ourselves to non- 
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repeating URL sequences, the naive approach of Sec- 
tion 3 would test a number of sessions proportional to 
the factorial of the number of event handlers. In JGossip, 
this is approximately 101° sequences. 


6.1 PersonalBlog 


The PersonalBlog system is a web application based on 
Struts and the Hibernate 2 object persistence system [2]. 
It makes no interesting use of session objects, so there 
are no dependencies between handlers. Thus, the depen- 
dence analysis shows that we can consider each event 
handler in isolation without compromising any guaran- 
tee on security. Since there are only 15 event handlers 
in the program, and each request has few parameters, the 
model checker can run through all the cases quickly. 
QED found one XSS attack vector and two SQL attack 
vectors. Note that a single vector can have multiple vul- 
nerabilities. In this case, one of the SQL vectors has two 
SQL injection possibilities. Thus, there are actually three 
SQL vulnerabilities that we have found. The static anal- 
ysis in this case was accurate in identifying all the vul- 
nerabilities, without generating any false positives. The 
model checker generates the input vectors and a program 
execution trace showing the details of their existence. 
The results of running PQL itself, as a dynamic 
checker, on PersonalBlog has also been reported previ- 
ously [22]. Not only did QED find all the vulnerabilities 
previously identified, it found an additional one. This 
discrepancy is due to QED having a more inclusive spec- 
ification than in the previous work, tracking information 
from HTTP headers and not just from the URL proper. 


6.2 JOrganizer 


JOrganizer is a personal contact and appointment man- 
ager of moderate size. Access to the backing database 
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is managed within the application by an “Object Query 
Language” that reduces directly to SQL, much like Hi- 
bernate 2. 

The application has 46 event handlers in total. The de- 
pendence analysis shows that there are 49 pairs of depen- 
dent event handlers. The dependence relations are cyclic, 
which means that we will have to restrict our attention to 
acyclic sequences to keep the test space finite. 

QED then further focuses the model checking effort 
by using information specific to each vulnerability. We 
found that 15 of the event handlers cannot touch the 
database at all, and thus cannot be final events for SQL 
injection. Furthermore, none of the single-event se- 
quences exhibits a SQL vulnerability. The reason is that 
no event is allowed to touch the database unless it is pre- 
ceded by a “log-in” event. Our analysis shows a de- 
pendence between these events and the “log-in” event. 
QED ignores the independent pairs, and keeps testing 
sequences with first a log-in event and then a database 
access event. 

QED is able to iterate through all the filtered, non- 
redundant, and non-repeating sequences in this case, 
finding three XSS vulnerabilities and eight SQL vulner- 
abilities. 


6.3 JGossip 


JGossip is a large application with nearly 80,000 lines 
of code in 80 actions. There are many cyclic dependen- 
cies among event handlers in JGossip. Even if we restrict 
the sequences under consideration to non-redundant and 
non-repeating events, over a million sequences still re- 
main. Furthermore, within these sequences, many re- 
quests used enormous numbers of input parameters. One 
event had 15 parameters, which, with a pool of 5 possible 
inputs per parameter, would generate over 30 billion test 
cases simply for that one URL. For event handlers such 
as those we restricted our model checker’s input pool to 
two possibilities per parameter, lowering the number of 
test cases per handler to a more manageable 32,000. 
Next QED tries to reduce the number of candidate vec- 
tors based on the vulnerability specification. For SQL 
injection, the taint sink method is database queries. A 
majority of the 80 actions touch the database. However, 
our static feasibility analysis shows that only seven of 
these database accesses may touch tainted objects. Thus 
we have only 7 final events to consider. Of the seven, five 
have no dependency chains longer than length two. They 
are responsible for a total of 37 potential attack vectors, 
and they are all of length 2. The remaining two have 
many dependencies, and the static analyzer can only nar- 
row them to 9,436 candidate attack vectors. Once param- 
eters are factored in, this still yields hundreds of millions 
of candidates to check, so there are still too many to con- 
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sider. At the end, we managed to check only all the seven 
sequences with a single taint-sink event, and the all 37 
sequences of length 2. The model checker found no SQL 
vulnerability. 

As it happens, this lack of SQL injections is unsurpris- 
ing, because JGossip is constructed to be independent 
of its database backend. As such, all of its database re- 
quests are ultimately constant strings; it uses a hash table 
to look up which strings are appropriate for the appropri- 
ate action, based on the SQL dialect used by the backend. 
Since all SQL queries end up being constant strings, this 
suggests that no injections are possible; however, its use 
of hash tables forced the program analyzer to make con- 
servative approximations on seven of the actions, thus 
leading to the need for a model checking step. 

The tests for XSS were much more straightforward; all 
of the actions corresponding to possibly dangerous out- 
put JSPs had few inputs and few dependencies, leading 
to a grand total of only 30 sessions to check. The XSS 
vulnerabilities so found were also located immediately, 
since session data did not affect their outputs. 


6.4 Experimental Summary 


The three web applications in our experimental study il- 
lustrate a spectrum of effects we can get with QED. Per- 
sonalBlog shows an example where QED is able to prove 
that there are no vulnerabilities other than the ones found. 
By proving that the events have no dependencies, QED 
can simply check the URLs one at a time. JOrganizer 
shows that in the presence of dependencies, our analyses 
can greatly improve the effectiveness of model checking 
and provides good coverage. QED was able to check all 
the sequences without repeated URLs. Lastly, JGossip 
shows that model checking for really large programs re- 
mains a challenge. The static analyzer is useful as a way 
of directing the model checking to focus on sequences 
with higher payoffs. 


7 Related Work 


Systematic automated testing is not entirely novel, but 
it is also not commonplace. Our work was informed by 
both the FiSC system [34] and WebSSARI [17]. Web- 
SSARI’s approach is much different from QED’s, in that 
it focuses on abstract interpretation of PHP code looking 
for violations of data flow control. QED, on the other 
hand, owes more of its design philosophy to FiSC. FiSC 
operated in an entirely different problem domain (filesys- 
tem correctness) and simply searched for evidence of er- 
rors rather than the cause. Its implementation was based 
on the CMC model checker [23] which is also much 
closer to our JPF-based system than WebSSARI’s run- 
time solution. 
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The techniques described in this paper touch upon a 
wide variety of disciplines. Model checking is the most 
directly obvious of these. Our system uses the Java 
PathFinder system [29]. JPF was suitable for our system 
primarily due to the ability to directly run sizable Java ap- 
plications as bytecode; this permitted us to treat our dy- 
namic analysis as just another part of the application be- 
ing checked. Classical model checkers such as SPIN [14] 
require a special specification language which abstracts 
the application greatly. Other model checking systems 
such as Bandera [8] also directly abstract the Java source, 
which complicates its utility for our purposes. 

The more general field of bugfinding comprises an 
enormous amount of work. In recent years, web ap- 
plications have received a good deal of attention due 
to their unique vulnerabilities and flaws. SABER is 
a static tool that detects flaws based on pattern tem- 
plates [25]. Livshits and Lam made progress in creat- 
ing a sound analysis on web applications that produced 
a usably low false positive rate [20]. The WebSSARI 
system, in its pre-model-checking work, allows the spec- 
ification of taint-style data-flow problems on PHP-based 
applications, and systematically searches for dangerous 
information flows [16]. Nguyen-Tuong et al. use similar 
approaches, also for PHP [24]. In a more general con- 
text FindBugs attempts to locate a broad class of bugs 
in Java applications of all kinds [15], and the Metal sys- 
tem let the user specify state machines to represent error 
conditions [11]. The SQLCHECK system uses a much 
more precise technique to detect grammatical changes in 
commands as a result of user input [26]. The QED sys- 
tem provides a general analysis that the user specializes, 
while SQLCHECK is SQL-injection specific and Find- 
Bugs is a battery of unrelated analyses. Taint flow within 
an application is tracked incidentally, and only if the PQL 
specification demands it. 

Our characterization of inputs, when combined with 
model checking, can be seen as a form of testing, and all 
testing techniques perform better with a better set of in- 
puts. Some work has been done on systematically deduc- 
ing inputs that will explore the state space of an applica- 
tion. Systems such as Korat [5] attempt to systematically 
produce only consistent inputs; this is rarely relevant 
to web applications, whose arguments can be nearly- 
arbitrary strings. Korat’s general principle of deducing 
input sets from execution constraints, however, may still 
be applicable. Symbolic execution techniques, such as 
DART [10] and EXE [33], suitably adapted to deal with 
string and URL data, are more likely to be a fruitful ad- 
junct to the techniques in this paper. Some work has been 
done already to provide these techniques for JPF but the 
results given seem to indicate that at present it scales only 
to smaller applications [30]. 

For the specific problem of cross-site scripting, re- 
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cent work has focused on extending the DOM to per- 
mit browser extensions to block out any unauthorized 
scripts [18]. While, if fully implemented, this system 
will block out any possible attacks, it requires cooper- 
ation between both site authors and clients. Client-side 
protection is also of limited use against taint problems 
such as SQL injection that attack the server. 


8 Summary and Conclusions 


Security concerns regarding web applications are here to 
stay, and likely only to grow in importance. Cross-site 
scripting and SQL injection are two of the most popular 
kinds of vulnerabilities. This paper presented a technique 
called goal-directed model checking that can find attack 
vectors for these vulnerabilities automatically and effi- 
ciently. Armed with actual attack vectors and their cor- 
responding execution trace, it is easier to convince the 
developers that it is necessary to change the code, and 
also to pinpoint how the problem can be fixed. 

Our technique is implemented in a system called QED. 
Users can use the system for any taint-based vulnerabil- 
ity on Java applications developed using servlets, JSPs, 
or Struts. We applied QED to three programs and found 
errors in every one of them, yielding a total of 10 SQL 
injection and 13 XSS vulnerabilities. This result is wor- 
risome, suggesting that there are plenty of security risks 
in using web applications. 

This work also shows for the first time how we can 
combine techniques from three approaches to generate a 
useful and powerful system: 


Sound, sophisticated program analysis. Sophisticated 
analysis based on context-sensitive pointer alias an- 
alysis is precise enough to use on production soft- 
ware, despite being conservative to retain sound- 
ness. Nonetheless, false positives are still bound to 
occur with a conservative analysis. 


Dynamic monitoring. Dynamic analysis does not have 
false positives, but it can only spot problems that its 
input happens to trigger. 


Model checking. Model checking has many advantages: 
it executes all the paths in a program; it has no false 
positives; it has no false negatives with respect to 
the set of possible inputs tried; it identifies actual at- 
tack vectors; and it can generates an execution trace 
for any input. However, it is too slow. 


QED combines the advantages of all the three ap- 
proaches. It uses sound analysis to optimize both dy- 
namic monitoring and model checking, dynamic mon- 
itoring to follow the flow of taint, and finally model 
checking to generate the actual attack vectors. 
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Cross-site scripting and SQL injection are examples of 
errors that exist at the application layer and that are not 
due to simple language deficiencies like buffer overruns. 
We can expect to see many more varieties of errors that 
operate at this higher semantic level. This suggests that 
programmable systems like bddbddb, PQL, and QED are 
important so that developers can utilize the technology, 
without being analysis experts, for their own programs. 

The widespread adoption of application frameworks 
in software development opens up a new opportunity for 
managing software complexity. These software frame- 
works should come with testing, model checking, static 
analysis, and dynamic monitoring submodules; they 
should be programmable and specialized for that frame- 
work. Perfecting them as part of the framework will put 
these advanced technologies in the hands of many more 
developers. 
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Abstract 


Contrary to popular assumption, DRAMs used in most 
modern computers retain their contents for several sec- 
onds after power is lost, even at room temperature and 
even if removed from a motherboard. Although DRAMs 
become less reliable when they are not refreshed, they 
are not immediately erased, and their contents persist 
sufficiently for malicious (or forensic) acquisition of us- 
able full-system memory images. We show that this phe- 
nomenon limits the ability of an operating system to pro- 
tect cryptographic key material from an attacker with 
physical access. We use cold reboots to mount successful 
attacks on popular disk encryption systems using no spe- 
cial devices or materials. We experimentally characterize 
the extent and predictability of memory remanence and 
report that remanence times can be increased dramatically 
with simple cooling techniques. We offer new algorithms 
for finding cryptographic keys in memory images and for 
correcting errors caused by bit decay. Though we discuss 
several strategies for partially mitigating these risks, we 
know of no simple remedy that would eliminate them. 


1 Introduction 


Most security experts assume that a computer’s memory 
is erased almost immediately when it loses power, or that 
whatever data remains is difficult to retrieve without spe- 
cialized equipment. We show that these assumptions are 
incorrect. Ordinary DRAMs typically lose their contents 
gradually over a period of seconds, even at standard oper- 
ating temperatures and even if the chips are removed from 
the motherboard, and data will persist for minutes or even 
hours if the chips are kept at low temperatures. Residual 
data can be recovered using simple, nondestructive tech- 
niques that require only momentary physical access to the 
machine. 

We present a suite of attacks that exploit DRAM re- 
manence effects to recover cryptographic keys held in 
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memory. They pose a particular threat to laptop users who 
rely on disk encryption products, since an adversary who 
steals a laptop while an encrypted disk is mounted could 
employ our attacks to access the contents, even if the com- 
puter is screen-locked or suspended. We demonstrate this 
risk by defeating several popular disk encryption systems, 
including BitLocker, TrueCrypt, and FileVault, and we 
expect many similar products are also vulnerable. 

While our principal focus is disk encryption, any sen- 
sitive data present in memory when an attacker gains 
physical access to the system could be subject to attack. 
Many other security systems are probably vulnerable. For 
example, we found that Mac OS X leaves the user’s lo- 
gin password in memory, where we were able to recover 
it, and we have constructed attacks for extracting RSA 
private keys from Apache web servers. 

As we discuss in Section 2, certain segments of the 
computer security and semiconductor physics communi- 
ties have been conscious of DRAM remanence effects 
for some time, though strikingly little about them has 
been published. As a result, many who design, deploy, or 
rely on secure systems are unaware of these phenomena 
or the ease with which they can be exploited. To our 
knowledge, ours is the first comprehensive study of their 
security consequences. 


Highlights and roadmap _ In Section 3, we describe 
experiments that we conducted to characterize DRAM 
remanence in a variety of memory technologies. Contrary 
to the expectation that DRAM loses its state quickly if 
it is not regularly refreshed, we found that most DRAM 
modules retained much of their state without refresh, and 
even without power, for periods lasting thousands of re- 
fresh intervals. At normal operating temperatures, we 
generally saw a low rate of bit corruption for several sec- 
onds, followed by a period of rapid decay. Newer memory 
technologies, which use higher circuit densities, tended 
to decay more quickly than older ones. In most cases, we 
observed that almost all bits decayed at predictable times 
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and to predictable “ground states” rather than to random 
values. 

We also confirmed that decay rates vary dramatically 
with temperature. We obtained surface temperatures of 
approximately —50°C with a simple cooling technique: 
discharging inverted cans of “canned air” duster spray 
directly onto the chips. At these temperatures, we typi- 
cally found that fewer than 1% of bits decayed even after 
10 minutes without power. To test the limits of this ef- 
fect, we submerged DRAM modules in liquid nitrogen 
(ca. —196°C) and saw decay of only 0.17% after 60 min- 
utes out of the computer. 

In Section 4, we present several attacks that exploit 
DRAM remanence to acquire memory images from which 
keys and other sensitive data can be extracted. Our attacks 
come in three variants, of increasing resistance to coun- 
termeasures. The simplest is to reboot the machine and 
launch a custom kernel with a small memory footprint 
that gives the adversary access to the retained memory. A 
more advanced attack briefly cuts power to the machine, 
then restores power and boots a custom kernel; this de- 
prives the operating system of any opportunity to scrub 
memory before shutting down. An even stronger attack 
cuts the power and then transplants the DRAM modules 
to a second PC prepared by the attacker, which extracts 
their state. This attack additionally deprives the original 
BIOS and PC hardware of any chance to clear the memory 
on boot. We have implemented imaging kernels for use 
with network booting or a USB drive. 

If the attacker is forced to cut power to the memory for 
too long, the data will become corrupted. We propose 
three methods for reducing corruption and for correct- 
ing errors in recovered encryption keys. The first is to 
cool the memory chips prior to cutting power, which dra- 
matically reduces the error rate. The second is to apply 
algorithms we have developed for correcting errors in 
private and symmetric keys. The third is to replicate the 
physical conditions under which the data was recovered 
and experimentally measure the decay properties of each 
memory location; with this information, the attacker can 
conduct an accelerated error correction procedure. These 
techniques can be used alone or in combination. 

In Section 5, we explore the second error correction 
method: novel algorithms that can reconstruct crypto- 
graphic keys even with relatively high bit-error rates. 
Rather than attacking the key directly, our methods con- 
sider values derived from it, such as key schedules, that 
provide a higher degree of redundancy. For performance 
reasons, many applications precompute these values and 
keep them in memory for as long as the key itself is in 
use. To reconstruct an AES key, for example, we treat the 
decayed key schedule as an error correcting code and find 
the most likely values for the original key. Applying this 
method to keys with 10% of bits decayed, we can recon- 
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struct nearly any 128-bit AES key within a few seconds. 
We have devised reconstruction techniques for AES, DES, 
and RSA keys, and we expect that similar approaches 
will be possible for other cryptosystems. The vulnerabil- 
ity of precomputation products to such attacks suggests 
an interesting trade-off between efficiency and security. 
In Section 6, we present fully automatic techniques for 
identifying such keys from memory images, even in the 
presence of bit errors. 

We demonstrate the effectiveness of these attacks in 
Section 7 by attacking several widely used disk encryption 
products, including BitLocker, TrueCrypt, and FileVault. 
We have developed a fully automated demonstration at- 
tack against BitLocker that allows access to the contents 
of the disk with only a few minutes of computation. No- 
tably, using BitLocker with a Trusted Platform Module 
(TPM) sometimes makes it Jess secure, allowing an at- 
tacker to gain access to the data even if the machine is 
stolen while it is completely powered off. 

It may be difficult to prevent all the attacks that we de- 
scribe even with significant changes to the way encryption 
products are designed and used, but in practice there are a 
number of safeguards that can provide partial resistance. 
In Section 8, we suggest a variety of mitigation strategies 
ranging from methods that average users can apply to- 
day to long-term software and hardware changes. Each 
remedy has limitations and trade-offs. As we conclude 
in Section 9, it seems there is no simple fix for DRAM 
remanence vulnerabilities. 


Online resources A video demonstration of our attacks 
and source code for some of our tools are available at 
http://citp.princeton.edu/memory. 


2 Previous Work 


Previous researchers have suggested that data in DRAM 
might survive reboots, and that this fact might have se- 
curity implications. To our knowledge, however, ours is 
the first security study to focus on this phenomenon, the 
first to consider how to reconstruct symmetric keys in the 
presence of errors, the first to apply such attacks to real 
disk encryption systems, and the first to offer a systematic 
discussion of countermeasures. 

We owe the suggestion that modern DRAM contents 
can survive cold boot to Pettersson [33], who seems to 
have obtained it from Chow, Pfaff, Garfinkel, and Rosen- 
blum [13]. Pettersson suggested that remanence across 
cold boot could be used to acquire forensic memory im- 
ages and obtain cryptographic keys, although he did not 
experiment with the possibility. Chow et al. discovered 
this property in the course of an experiment on data life- 
time in running systems. While they did not exploit the 
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Memory Type | Chip Maker | Memory Density Make/Model Year 
A SDRAM Infineon 128Mb Dell Dimension 4100 | 1999 
B DDR Samsung 512Mb Toshiba Portégé 2001 
C DDR Micron 256Mb Dell Inspiron 5100 | 2003 
D DDR2 Infineon 512Mb IBM T43p 2006 
E DDR2 Elpida 512Mb IBM x60 2007 
F DDR2 Samsung 512Mb Lenovo 3000 N100 | 2007 





Table 1: Test systems we used in our experiments 





property, they remark on the negative security implica- 
tions of relying on a reboot to clear memory. 

In a recent presentation, Maclver [31] stated that Mi- 
crosoft considered memory remanence attacks in design- 
ing its BitLocker disk encryption system. He acknowl- 
edged that BitLocker is vulnerable to having keys ex- 
tracted by cold-booting a machine when it is used in 
“basic mode” (where the encrypted disk is mounted auto- 
matically without requiring a user to enter any secrets), 
but he asserted that BitLocker is not vulnerable in “ad- 
vanced modes” (where a user must provide key material 
to access the volume). He also discussed cooling mem- 
ory with dry ice to extend the retention time. MaclIver 
apparently has not published on this subject. 

It has been known since the 1970s that DRAM cell 
contents survive to some extent even at room temperature 
and that retention times can be increased by cooling. In 
a 1978 experiment [29], a DRAM showed no data loss 
for a full week without refresh when cooled with liquid 
nitrogen. Anderson [2] briefly discusses remanence in his 
2001 book: 


[A]n attacker can ... exploit ... memory re- 
manence, the fact that many kinds of computer 
memory retain some trace of data that have been 
stored there. ... [MJodern RAM chips exhibit 
a wide variety of memory remanence behaviors, 
with the worst of them keeping data for several 
seconds even at room temperature. .. 


Anderson cites Skorobogatov [40], who found signifi- 
cant data retention times with static RAMs at room tem- 
perature. Our results for modern DRAMs show even 
longer retention in some cases. 

Anderson’s main focus is on “burn-in” effects that oc- 
cur when data is stored in RAM for an extended period. 
Gutmann [22, 23] also examines “burn-in,” which he at- 
tributes to physical changes that occur in semiconductor 
memories when the same value is stored in a cell for 
a long time. Accordingly, Gutmann suggests that keys 
should not be stored in one memory location for longer 
than several minutes. Our findings concern a different 
phenomenon: the remanence effects we have studied oc- 
cur in modern DRAMs even when data is stored only 


USENIX Association 


momentarily. These effects do not result from the kind 
of physical changes that Gutmann described, but rather 
from the capacitance of DRAM cells. 

Other methods for obtaining memory images from live 
systems include using privileged software running un- 
der the host operating system [43], or using DMA trans- 
fer on an external bus [19], such as PCI [12], mini-PCI, 
Firewire [8, 15, 16], or PC Card. Unlike these techniques, 
our attacks do not require access to a privileged account 
on the target system, they do not require specialized hard- 
ware, and they are resistant to operating system counter- 
measures. 


3 Characterizing Remanence Effects 


A DRAM cell is essentially a capacitor. Each cell encodes 
a single bit by either charging or not charging one of the 
capacitor’s conductors. The other conductor is hard-wired 
either to power or to ground, depending on the cell’s 
address within the chip [37, 23]. 

Over time, charge will leak out of the capacitor, and the 
cell will lose its state or, more precisely, it will decay to its 
ground state, either zero or one depending on whether the 
fixed conductor of the capacitor is hard-wired to ground or 
power. To forestall this decay, the cell must be refreshed, 
meaning that the capacitor must be re-charged to hold 
its value. Specifications for DRAM chips give a refresh 
time, which is the maximum interval that is supposed to 
pass before a cell is refreshed. The standard refresh time 
(usually on the order of milliseconds) is meant to achieve 
extremely high reliability for normal computer operations 
where even infrequent bit errors could cause serious prob- 
lems; however, a failure to refresh any individual DRAM 
cell within this time has only a tiny probability of actually 
destroying the cell’s contents. 

We conducted a series of experiments to characterize 
DRAM remanence effects and better understand the secu- 
rity properties of modern memories. We performed trials 
using PC systems with different memory technologies, as 
shown in Table 1. These systems included models from 
several manufacturers and ranged in age from 9 years to 
6 months. 
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3.1 Decay at operating temperature 


Using a modified version of our PXE memory imaging 
program (see Section 4.1), we filled representative mem- 
ory regions with a pseudorandom pattern. We read back 
these memory regions after varying periods of time with- 
out refresh and under different temperature conditions, 
and measured the error rate of each sample. The error 
rate is the number of bit errors in each sample (the Ham- 
ming distance from the pattern we had written) divided by 
the total number of bits we measured. Since our pseudo- 
random test pattern contained roughly equal numbers of 
zeros and ones, we would expect fully decayed memory 
to have an error rate of approximately 50% . 

Our first tests measured the decay rate of each mem- 
ory module under normal operating temperature, which 
ranged from 25.5°C to 44.1°C, depending on the ma- 
chine (see Figures 1, 2, and 3). We found that the dimen- 
sions of the decay curves varied considerably between 
machines, with the fastest exhibiting complete data loss 
in approximately 2.5 seconds and the slowest taking an 
average of 35 seconds. However, the decay curves all dis- 
play a similar shape, with an initial period of slow decay, 
followed by an intermediate period of rapid decay, and 
then a final period of slow decay. 

We calculated best fit curves to the data using the logis- 
tic function because MOSFETs, the basic components of 
a DRAM cell, exhibit a logistic decay curve. We found 
that machines using newer memory technologies tend to 
exhibit a shorter time to total decay than machines using 
older memory technologies, but even the shorter times 
are long enough to facilitate most of our attacks. We as- 
cribe this trend to the increasing density of the DRAM 
cells as the technology improves; in general, memory 
with higher densities have a shorter window where data 
is recoverable. While this trend might make DRAM re- 
tention attacks more difficult in the future, manufacturers 
also generally seek to increase retention times, because 
DRAMs with long retention require less frequent refresh 
and have lower power consumption. 


3.2 Decay at reduced temperature 


It has long been known that low temperatures can signifi- 
cantly increase memory devices’ retention times [29, 2, 
46, 23, 41, 40]. To measure this effect, we performed a 
second series of tests using machines A-D. 

In each trial, we loaded a pseudorandom test pattern 
into memory, and, with the computer running, cooled 
the memory module to approximately —50°C. We then 
powered off the machine and maintained this temperature 
until power was restored. We achieved these temperatures 
using commonly available “canned air’ duster products 
(see Section 4.2), which we discharged, with the can 
inverted, directly onto the chips. 
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Figure 2: Machines B and F 
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Figure 3: Machines D and E 
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Seconds Error % at Error % 
w/o power | operating temp. | at —SO0°C 
A 60 41 (no errors) 
300 50 0.000095 
B 360 50 (no errors) 
600 50 0.000036 
C 120 41 0.00105 
360 42 0.00144 
D 40 50 0.025 
80 50 0.18 























Table 2: Effect of cooling on error rates 


As expected, we observed a significantly lower rate 
of decay under these reduced temperatures (see Table 2). 
On all of our sample DRAMs, the decay rates were low 
enough that an attacker who cut power for 60 seconds 
would recover 99.9% of bits correctly. 

As an extreme test of memory cooling, we performed 
another experiment using liquid nitrogen as an additional 
cooling agent. We first cooled the memory module of 
Machine A to —50°C using the “canned air” product. 
We then cut power to the machine, and quickly removed 
the DRAM module and placed it in a canister of liquid 
nitrogen. We kept the memory module submerged in the 
liquid nitrogen for 60 minutes, then returned it to the 
machine. We measured only 14,000 bit errors within a 1 
MB test region (0.17% decay). This suggests that, even 
in modern memory modules, data may be recoverable for 
hours or days with sufficient cooling. 


3.3. Decay patterns and predictability 


We observed that the DRAMs we studied tended to decay 
in highly nonuniform patterns. While these patterns var- 
ied from chip to chip, they were very predictable in most 
of the systems we tested. Figure 4 shows the decay in 
one memory region from Machine A after progressively 
longer intervals without power. 

There seem to be several components to the decay 
patterns. The most prominent is a gradual decay to the 
“ground state” as charge leaks out of the memory cells. In 
the DRAM shown in Figure 4, blocks of cells alternate 
between a ground state of 0 and a ground state of 1, result- 
ing in the series of horizontal bars. Other DRAM models 
and other regions within this DRAM exhibited different 
ground states, depending on how the cells are wired. 

We observed a small number of cells that deviated from 
the “ground state” pattern, possibly due to manufacturing 
variation. In experiments with 20 or 40 runs, a few “‘tet- 
rograde” cells (typically ~ 0.05% of memory cells, but 
larger in a few devices) always decayed to the opposite 
value of the one predicted by the surrounding ground state 


USENIX Association 


pattern. An even smaller number of cells decayed in dif- 
ferent directions across runs, with varying probabilities. 


Apart from their eventual states, the order in which 
different cells decayed also appeared to be highly pre- 
dictable. At a fixed temperature, each cell seems to decay 
after a consistent length of time without power. The rel- 
ative order in which the cells decayed was largely fixed, 
even as the decay times were changed by varying the 
temperature. This may also be a result of manufacturing 
variations, which result in some cells leaking charge faster 
than others. 


To visualize this effect, we captured degraded memory 
images, including those shown in Figure 4, after cutting 
power for intervals ranging from 1 second to 5 minutes, 
in | second increments. We combined the results into a 
video (available on our web site). Each test interval began 
with the original image freshly loaded into memory. We 
might have expected to see a large amount of variation 
between frames, but instead, most bits appear stable from 
frame to frame, switching values only once, after the 
cell’s decay interval. The video also shows that the decay 
intervals themselves follow higher order patterns, likely 
related to the physical geometry of the DRAM. 


3.4 BIOS footprints and memory wiping 


Even if memory contents remain intact while power is 
off, the system BIOS may overwrite portions of memory 
when the machine boots. In the systems we tested, the 
BIOS overwrote only relatively small fractions of memory 
with its own code and data, typically a few megabytes 
concentrated around the bottom of the address space. 


On many machines, the BIOS can perform a destructive 
memory check during its Power-On Self Test (POST). 
Most of the machines we examined allowed this test to be 
disabled or bypassed (sometimes by enabling an option 
called “Quick Boot’). 


On other machines, mainly high-end desktops and 
servers that support ECC memory, we found that the 
BIOS cleared memory contents without any override op- 
tion. ECC memory must be set to a known state to avoid 
spurious errors if memory is read without being initial- 
ized [6], and we believe many ECC-capable systems per- 
form this wiping operation whether or not ECC memory 
is installed. 


ECC DRAMs are not immune to retention effects, and 
an attacker could transfer them to a non-ECC machine 
that does not wipe its memory on boot. Indeed, ECC 
memory could turn out to help the attacker by making 
DRAM more resistant to bit errors. 
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Figure 4: We loaded a bitmap image into memory on Machine A, then cut power for varying lengths of time. After 5 
seconds (left), the image is indistinguishable from the original. It gradually becomes more degraded, as shown after 


30 seconds, 60 seconds, and 5 minutes. 


4 Imaging Residual Memory 


Imaging residual memory contents requires no special 
equipment. When the system boots, the memory con- 
troller begins refreshing the DRAM, reading and rewriting 
each bit value. At this point, the values are fixed, decay 
halts, and programs running on the system can read any 
data present using normal memory-access instructions. 


4.1 Imaging tools 


One challenge is that booting the system will necessarily 
overwrite some portions of memory. Loading a full oper- 
ating system would be very destructive. Our approach is 
to use tiny special-purpose programs that, when booted 
from either a warm or cold reset state, produce accurate 
dumps of memory contents to some external medium. 
These programs use only trivial amounts of RAM, and 
their memory offsets used can be adjusted to some extent 
to ensure that data structures of interest are unaffected. 

Our memory-imaging tools make use of several differ- 
ent attack vectors to boot a system and extract the contents 
of its memory. For simplicity, each saves memory images 
to the medium from which it was booted. 


PXE network boot Most modern PCs support net- 
work booting via Intel’s Preboot Execution Environment 
(PXE) [25], which provides rudimentary startup and net- 
work services. We implemented a tiny (9 KB) standalone 
application that can be booted via PXE and whose only 
function is streaming the contents of system RAM via 
a UDP-based protocol. Since PXE provides a universal 
API for accessing the underlying network hardware, the 
same binary image will work unmodified on any PC sys- 
tem with PXE support. In a typical attack setup, a laptop 
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connected to the target machine via an Ethernet crossover 
cable runs DHCP and TFTP servers as well as a simple 
client application for receiving the memory data. We have 
extracted memory images at rates up to 300 Mb/s (around 
30 seconds for a 1 GB RAM) with gigabit Ethernet cards. 


USB drives Alternatively, most PCs can boot from an 
external USB device such as a USB hard drive or flash 
device. We implemented a small (10 KB) plug-in for the 
SYSLINUX bootloader [3] that can be booted from an 
external USB device or a regular hard disk. It saves the 
contents of system RAM into a designated data partition 
on this device. We succeeded in dumping | GB of RAM 
to a flash drive in approximately 4 minutes. 


EFI boot Some recent computers, including all Intel- 
based Macintosh computers, implement the Extensible 
Firmware Interface (EFI) instead of a PC BIOS. We have 
also implemented a memory dumper as an EFI netboot 
application. We have achieved memory extraction speeds 
up to 136 Mb/s, and we expect it will be possible to 
increase this throughput with further optimizations. 


iPods We have installed memory imaging tools on an 
Apple iPod so that it can be used to covertly capture 
memory dumps without impacting its functionality as a 
music player. This provides a plausible way to conceal 
the attack in the wild. 


4.2 Imaging attacks 


An attacker could use imaging tools like ours in a number 
of ways, depending on his level of access to the system 
and the countermeasures employed by hardware and soft- 
ware. 
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Figure 5: Before powering off the computer, we spray an upside-down canister of multipurpose duster directly onto the 
memory chips, cooling them to —50°C. At this temperature, the data will persist for several minutes after power loss 
with minimal error, even if we remove the DIMM from the computer. 


Simple reboots The simplest attack is to reboot the 
machine and configure the BIOS to boot the imaging 
tool. A warm boot, invoked with the operating system’s 
restart procedure, will normally ensure that the memory 
has no chance to decay, though software will have an 
opportunity to wipe sensitive data prior to shutdown. A 
cold boot, initiated using the system’s restart switch or by 
briefly removing and restoring power, will result in little 
or no decay depending on the memory’s retention time. 
Restarting the system in this way denies the operating 
system and applications any chance to scrub memory 
before shutting down. 


Transferring DRAM modules — Even if an attacker can- 
not force a target system to boot memory-imaging tools, 
or if the target employs countermeasures that erase mem- 
ory contents during boot, DIMM modules can be phys- 
ically removed and their contents imaged using another 
computer selected by the attacker. 

Some memory modules exhibit far faster decay than 
others, but as we discuss in Section 3.2 above, cooling a 
module before powering it off can slow decay sufficiently 
to allow it to be transferred to another machine with mini- 
mal decay. Widely-available “canned air’ dusters, usually 
containing a compressed fluorohydrocarbon refrigerant, 
can easily be used for this purpose. When the can is dis- 
charged in an inverted position, as shown in Figure 5, it 
dispenses its contents in liquid form instead of as a gas. 
The rapid drop in pressure inside the can lowers the tem- 
perature of the discharge, and the subsequent evaporation 
of the refrigerant causes a further chilling. By spraying 
the contents directly onto memory chips, we can cool their 
surfaces to —50°C and below. If the DRAM is cooled to 
this temperature before power is cut and kept cold, we 
can achieve nearly lossless data recovery even after the 
chip is out of the computer for several minutes. 

Removing the memory modules can also allow the 
attacker to image memory in address regions where stan- 
dards BIOSes load their own code during boot. The at- 
tacker could remove the primary memory module from 
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the target machine and place it into the secondary DIMM 
slot (in the same machine or another machine), effectively 
remapping the data to be imaged into a different part of 
the address space. 


5 Key Reconstruction 


Our experiments (see Section 3) show that it is possible 
to recover memory contents with few bit errors even af- 
ter cutting power to the system for a brief time, but the 
presence of even a small amount of error complicates 
the process of extracting correct cryptographic keys. In 
this section we present algorithms for correcting errors 
in symmetric and private keys. These algorithms can cor- 
rect most errors quickly even in the presence of relatively 
high bit error probabilities in the range of 5% to 50%, 
depending on the type of key. 

A naive approach to key error correction is to brute- 
force search over keys with a low Hamming distance from 
the decayed key that was retrieved from memory, but this 
is computationally burdensome even with a moderate 
amount of unidirectional error. As an example, if only 
10% of the ones have decayed to zeros in our memory 
image, the data recovered from a 256-bit key with an equal 
number of ones and zeroes has an expected Hamming 
distance of 12 from the actual key, and the number of 
such keys is Ca) 2. 

Our algorithms achieve significantly better perfor- 
mance by considering data other than the raw form of 
the key. Most encryption programs speed up computation 
by storing data precomputed from the encryption keys— 
for block ciphers, this is most often a key schedule, with 
subkeys for each round; for RSA, this is an extended form 
of the private key which includes the primes p and q and 
several other values derived from d. This data contains 
much more structure than the key itself, and we can use 
this structure to efficiently reconstruct the original key 
even in the presence of errors. 

These results imply an interesting trade-off between 
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efficiency and security. All of the disk encryption systems 
we studied (see Section 7) precompute key schedules and 
keep them in memory for as long as the encrypted disk is 
mounted. While this practice saves some computation for 
each disk block that needs to be encrypted or decrypted, 
we find that it greatly simplifies key recovery attacks. 

Our approach to key reconstruction has the advantage 
that it is completely self-contained, in that we can recover 
the key without having to test the decryption of cipher- 
text. The data derived from the key, and not the decoded 
plaintext, provides a certificate of the likelihood that we 
have found the correct key. 

We have found it useful to adopt terminology from 
coding theory. We may imagine that the expanded key 
schedule forms a sort of error correcting code for the key, 
and the problem of reconstructing a key from memory 
may be recast as the problem of finding the closest code 
word (valid key schedule) to the data once it has been 
passed through a channel that has introduced bit errors. 


Modeling the decay Our experiments showed that al- 
most all memory bits tend to decay to predictable ground 
states, with only a tiny fraction flipping in the opposite 
direction. In describing our algorithms, we assume, for 
simplicity, that all bits decay to the same ground state. 
(They can be implemented without this requirement, as- 
suming that the ground state of each bit is known.) 

If we assume we have no knowledge of the decay pat- 
terns other than the ground state, we can model the de- 
cay with the binary asymmetric channel, in which the 
probability of a 1 flipping to 0 is some fixed dp and the 
probability of a 0 flipping to a 1 is some fixed 6,. 

In practice, the probability of decaying to the ground 
state approaches | as time goes on, while the probabil- 
ity of flipping in the opposite direction remains relatively 
constant and tiny (less than 0.1% in our tests). The ground 
state decay probability can be approximated from recov- 
ered key data by counting the fraction of 1s and Os, as- 
suming that the original key data contained roughly equal 
proportions of each value. 

We also observed that bits tended to decay in a pre- 
dictable order that could be learned over a series of timed 
decay trials, although the actual order of decay appeared 
fairly random with respect to location. An attacker with 
the time and physical access to run such a series of tests 
could easily adapt any of the approaches in this section to 
take this order into account and improve the performance 
of the error-correction. Ideally such tests would be able to 
replicate the conditions of the memory extraction exactly, 
but knowledge of the decay order combined with an esti- 
mate of the fraction of bit flips is enough to give a very 
good estimate of an individual decay probability of each 
bit. This probability can be used in our reconstruction 
algorithms to prioritize guesses. 
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For simplicity and generality, we will analyze the algo- 
rithms assuming no knowledge of this decay order. 


5.1 Reconstructing DES keys 


We first apply these methods to develop an error correc- 
tion technique for DES. The DES key schedule algorithm 
produces 16 subkeys, each a permutation of a 48-bit sub- 
set of bits from the original 56-bit key. Every bit from the 
original key is repeated in about 14 of the 16 subkeys. 

In coding theory terms, we can treat the DES key sched- 
ule as a repetition code: the message is a single bit, and 
the corresponding codeword is a sequence of n copies of 
this bit. If 69 = 6) < 5 the optimal decoding of such an 
n-bit codeword is 0 if more than n/2 of the recovered bits 
are 0, and | otherwise. For 69 4 6), the optimal decod- 
ing is 0 if more than nr of the recovered bits are 0 and 1 
otherwise, where 


ee log(1 — dp) —log 6 
~ log(1 — 69) +log(1 — 6;) — log 6, — log 8’ 





For 69 = .1 and 6, = .001 (that is, we are in a block 
with ground state 0), r = .75 and this approach will fail to 
correctly decode a bit only if more than 3 of the 14 copies 
of a 0 decay to a 1, or more than 11 of the 14 copies of 
a | decay to 0. The probability of this event is less than 
10-°. Applying the union bound, the probability that any 
of the 56 key bits will be incorrectly decoded is at most 
56 x 10-° <6 x 10~®; even at 50% error, the probability 
that the key can be correctly decoded without resorting to 
brute force search is more than 98%. 

This technique can be trivially extended to correct er- 
rors in Triple DES keys. Since Triple DES applies the 
same key schedule algorithm to two or three 56-bit key 
components (depending on the version of Triple DES), 
the probability of correctly decoding each key bit is the 
same as for regular DES. With a decay rate of 69 = .5 and 
probability 6; = .001 of bit flips in the opposite direction, 
we can correctly decode a 112-bit Triple DES key with at 
least 97% probability and a 168-bit key with at least 96% 
probability. 


5.2 Reconstructing AES keys 


The AES key schedule has a more complex structure than 
the DES key schedule, but we can still use it to efficiently 
reconstruct a key in the presence of errors. 

A seemingly reasonable approach to this problem 
would be to search keys in order of distance to the recov- 
ered key and output any key whose schedule is sufficiently 
close to the recovered schedule. Our implementation of 
this algorithm took twenty minutes to search 10? candi- 
date keys in order to reconstruct a key in which 7 zeros 
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Figure 6: In the 128-bit AES key schedule, three bytes of 
each round key are entirely determined by four bytes of 
the preceding round key. 











had flipped to ones. At this rate it would take ten days to 
reconstruct a key with 11 bits flipped. 

We can do significantly better by taking advantage of 
the structure of the AES key schedule. Instead of trying 
to correct an entire key at once, we can examine a smaller 
set of bytes at a time. The high amount of linearity in the 
key schedule is what permits this separability—we can 
take advantage of pieces that are small enough to brute 
force optimal decodings for, yet large enough that these 
decodings are useful to reconstruct the overall key. Once 
we have a list of possible decodings for these smaller 
pieces of the key in order of likelihood, we can combine 
them into a full key to check against the key schedule. 

Since each of the decoding steps is quite fast, the run- 
ning time of the entire algorithm is ultimately limited 
by the number of combinations we need to check. The 
number of combinations is still roughly exponential in the 
number of errors, but it is a vast improvement over brute 
force searching and is practical in many realistic cases. 


Overview of the algorithm For 128-bit keys, an AES 
key expansion consists of 11 four-word (128-bit) round 
keys. The first round key is equal to the key itself. Each 
remaining word of the key schedule is generated either 
by XORing two words of the key schedule together, or by 
performing the key schedule core (in which the bytes of a 
word are rotated and each byte is mapped to a new value) 
on a word of the key schedule and XORing the result with 
another word of the key schedule. 

Consider a “slice” of the first two round keys consisting 
of byte i from words 1-3 of the first two round keys, and 
byte i— 1 from word 4 of the first round key (as shown 
in Figure 6). This slice is 7 bytes long, but is uniquely 
determined by the four bytes from the key. In theory, 
there are still 232 possibilities to examine for each slice, 
but we can do quite well by examining them in order of 
distance to the recovered key. For each possible set of 4 
key bytes, we generate the relevant three bytes of the next 
round key and calculate the probability, given estimates 
of dp and 6), that these seven bytes might have decayed 
to the corresponding bytes of the recovered round keys. 

Now we proceed to guess candidate keys, where a 
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candidate contains a value for each slice of bytes. We 
consider the candidates in order of decreasing total like- 
lihood as calculated above. For each candidate key we 
consider, we calculate the expanded key schedule and ask 
if the likelihood of that expanded key schedule decaying 
to our recovered key schedule is sufficiently high. If so, 
then we output the corresponding key as a guess. 

When one of 6g or 6; is very small, this algorithm will 
almost certainly output a unique guess for the key. To see 
this, observe that a single bit flipped in the key results in 
a cascade of bit flips through the key schedule, half of 
which are likely to flip in the “wrong” direction. 


Our implementation of this algorithm is able to re- 
construct keys with 15% error (that is, 69 = .15 and 
6, = .001) in a fraction of a second, and about half of 
keys with 30% error within 30 seconds. 

This idea can be extended to 256-bit keys by dividing 
the words of the key into two sections—words 1-3 and 8, 
and words 4—7, for example—then comparing the words 
of the third and fourth round keys generated by the bytes 
of these words and combining the result into candidate 
round keys to check. 


5.3. Reconstructing tweak keys 


The same methods can be applied to reconstruct keys for 
tweakable encryption modes [30], which are commonly 
used in disk encryption systems. 


LRW_ LRW augments a block cipher F (and key Kj) by 
computing a “tweak” X for each data block and encrypt- 
ing the block using the formula Ex,(P@X) @X. A tweak 
key K is used to compute the tweak, X = Kz ®/, where 
T is the logical block identifier. The operations © and ® 
are performed in the finite field GF (2'78). 

In order to speed tweak computations, implementations 
commonly precompute multiplication tables of the values 
Kox' mod P, where x is the primitive element and P is an 
irreducible polynomial over GF (2!78) [26]. In practice, 
Qx mod P is computed by shifting the bits of Q left by 
one and possibly XORing with P. 

Given a value Kx’, we can recover nearly all of the 
bits of Kz simply by shifting right by 7. The number of 
bits lost depends on i and the nonzero elements of P. An 
entire multiplication table will contain many copies of 
nearly all of the bits of Kz, allowing us to reconstruct the 
key in much the same way as the DES key schedule. 

As an example, we apply this method to reconstruct the 
LRW key used by the TrueCrypt 4 disk encryption system. 
TrueCrypt 4 precomputes a 4048-byte multiplication table 
consisting of 16 blocks of 16 lines of 4 words of 4 bytes 
each. Line 0 of block 14 contains the tweak key. 

The multiplication table is generated line by line from 
the LRW key by iteratively applying the shift-and-XOR 
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multiply function to generate four new values, and then 
XORing all combinations of these four values to create 
16 more lines of the table. The shift-and-XOR operation 
is performed 64 times to generate the table, using the 
irreducible polynomial P = x!?8 +.x7 +x? +x+ 1. For any 
of these 64 values, we can shift right i times to recover 
128 — (8 +) of the bits of Kz, and use these recovered 
values to reconstruct K with high probability. 


XEX and XTS_ For XEX [35] and XTS [24] modes, the 
tweak for block j in sector J is X = Ex, (I) ®x/, where 
T is encrypted with AES and x is the primitive element 
of GF(2!*8). Assuming the key schedule for K2 is kept 
in memory, we can use the AES key reconstruction tech- 
niques to reconstruct the tweak key. 


5.4 Reconstructing RSA private keys 


An RSA public key consists of the modulus N and the 
public exponent e, while the private key consists of the 
private exponent d and several optional values: prime fac- 
tors p and g of N,d mod(p—1),d mod (q—1), and 
gq ' mod p. Given N and e, any of the private values is 
sufficient to generate the others using the Chinese remain- 
der theorem and greatest common divisor algorithms. In 
practice, RSA implementations store some or all optional 
values to speed up computation. 

There have been a number of results on efficiently re- 
constructing RSA private keys given a fraction of the bits 
of private key data. Let n = lgN. N can be factored in 
polynomial time given the /4 least significant bits of p 
(Coppersmith [14]), given the n/4 least significant bits of 
d (Boneh, Durfee, and Frankel [9]), or given the n/4 least 
significant bits of d mod (p— 1) (Blémer and May [7]). 

These previous results are all based on Coppersmith’s 
method of finding bounded solutions to polynomial equa- 
tions using lattice basis reduction; the number of contigu- 
ous bits recovered from the most or least significant bits of 
the private key data determines the additive error tolerated 
in the solution. In our case, the errors may be distributed 
across all bits of the key data, so we are searching for 
solutions with low Hamming weight, and these previous 
approaches do not seem to be directly applicable. 

Given the public modulus N and the values p and g 
recovered from memory, we can deduce values for the 
original p and gq by iteratively reconstructing them from 
the least-significant bits. For unidirectional decay with 
probability 6, bits p; and gq; are uniquely determined by 
N; and our guesses for the i— 1 lower-order bits of p and g 
(observe that po = go = 1), except in the case when jp; and 
gi are both in the ground state. This yields a branching 
process with expected degree ee If decay is not 
unidirectional, we may use the estimated probabilities to 
weight the branches at each bit. 


17th USENIX Security Symposium 


Combined with a few heuristics—for example, choose 
the most likely state first, prune nodes by bounds on the 
solution, and iteratively increase the bit flips allowed— 
this results in a practical algorithm for reasonable error 
rates. This process can likely be improved substantially 
using additional data recovered from the private key. 

We tested an implementation of the algorithm on a fast 
modern machine. For fifty trials with 1024-bit primes 
(2048-bit keys) and 6 = 4%, the median reconstruction 
time was 4.5 seconds. The median number of nodes vis- 
ited was 16,499, the mean was 621,707, and the standard 
deviation was 2,136,870. For 6 = 6%, reconstruction re- 
quired a median of 2.5 minutes, or 227,763 nodes visited. 

For 512-bit primes and 6 = 10%, reconstruction re- 
quired a median of | minute, or 188,702 nodes visited. 

For larger error rates, we can attempt to reconstruct 
only the first 1/4 bits of the key using this process and 
use the lattice techniques to reconstruct the rest of the 
key; these computations generally take several hours in 
practice. For a 1024-bit RSA key, we would need to 
recover 256 bits of a factor. The expected depth of the 
tree from our branching reconstruction process would be 
(4 + 6)°256 (assuming an even distribution of 0s and 1s) 
and the expected fraction of branches that would need to 
be examined is 1/2 + 67. 


6 Identifying Keys in Memory 


Extracting encryption keys from memory images requires 
a mechanism for locating the target keys. A simple ap- 
proach is to test every sequence of bytes to see whether it 
correctly decrypts some known plaintext. Applying this 
method to a 1 GB memory image known to contain a 128- 
bit symmetric key aligned to some 4-byte machine word 
implies at most 278 possible key values. However, this is 
only the case if the memory image is perfectly accurate. 
If there are bit errors in the portion of memory containing 
the key, the search quickly becomes intractable. 

We have developed fully automatic techniques for locat- 
ing symmetric encryption keys in memory images, even 
in the presence of bit errors. Our approach is similar to 
the one we used to correct key bit errors in Section 5. We 
target the key schedule instead of the key itself, searching 
for blocks of memory that satisfy (or are close to satisfy- 
ing) the combinatorial properties of a valid key schedule. 
Using these methods we have been able to recover keys 
from closed-source encryption programs without having 
to disassemble them and reconstruct their key data struc- 
tures, and we have even recovered partial key schedules 
that had been overwritten by another program when the 
memory was reallocated. 

Although previous approaches to key recovery do not 
require a scheduled key to be present in memory, they 
have other practical drawbacks that limit their usefulness 
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for our purposes. Shamir and van Someren [39] pro- 
pose visual and statistical tests of randomness which can 
quickly identify regions of memory that might contain 
key material, but their methods are prone to false posi- 
tives that complicate testing on decayed memory images. 
Even perfect copies of memory often contain large blocks 
of random-looking data that might pass these tests (e.g., 
compressed files). Pettersson [33] suggests a plausibil- 
ity test for locating a particular program data structure 
that contains key material based on the range of likely 
values for each field. This approach requires the operator 
to manually derive search heuristics for each encryption 
application, and it is not very robust to memory errors. 


6.1 Identifying AES keys 


In order to identify scheduled AES keys in a memory 
image, we propose the following algorithm: 


1. Iterate through each byte of memory. Treat the fol- 
lowing block of 176 or 240 bytes of memory as an 
AES key schedule. 


2. For each word in the potential key schedule, calcu- 
late the Hamming distance from that word to the key 
schedule word that should have been generated from 
the surrounding words. 


3. If the total number of bits violating the constraints 
on a correct AES key schedule is sufficiently small, 
output the key. 


We created an application called keyfind that im- 
plements this algorithm for 128- and 256-bit AES keys. 
The program takes a memory image as input and outputs 
a list of likely keys. It assumes that key schedules are 
contained in contiguous regions of memory and in the 
byte order used in the AES specification [1]; this can be 
adjusted to target particular cipher implementations. A 
threshold parameter controls how many bit errors will be 
tolerated in candidate key schedules. We apply a quick 
test of entropy to reduce false positives. 

We expect that this approach can be applied to many 
other ciphers. For example, to identify DES keys based 
on their key schedule, calculate the distance from each 
potential subkey to the permutation of the key. A similar 
method works to identify the precomputed multiplication 
tables used for advanced cipher modes like LRW (see 
Section 5.3). 


6.2 Identifying RSA keys 


Methods proposed for identifying RSA private keys range 
from the purely algebraic (Shamir and van Someren sug- 
gest, for example, multiplying adjacent key-sized blocks 
of memory [39]) to the ad hoc (searching for the RSA 
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Object Identifiers found in ASN.1 key objects [34]). The 
former ignores the widespread use of standard key for- 
mats, and the latter seems insufficiently robust. 

The most widely used format for an RSA private key 
is specified in PKCS #1 [36] as an ASN.1 object of type 
RSAPrivateKey with the following fields: version, mod- 
ulus n, publicExponent e, privateExponent d, primel 
P, prime2 g, exponent! d mod (p— 1), exponent2 d 
mod (q—1), coefficient g~'! mod p, and optional other 
information. This object, packaged in DER encoding, is 
the standard format for storage and interchange of private 
keys. 

This format suggests two techniques we might use 
for identifying RSA keys in memory: we could search 
for known contents of the fields, or we could look for 
memory that matches the structure of the DER encoding. 
We tested both of these approaches on a computer running 
Apache 2.2.3 with mod_ss1l. 

One value in the key object that an attacker is likely 
to know is the public modulus. (In the case of a web 
server, the attacker can obtain this and the rest of the 
public key by querying the server.) We tried searching for 
the modulus in memory and found several matches, all of 
them instances of the server’s public or private key. 

We also tested a key finding method described by 
Ptacek [34] and others: searching for the RSA Object 
Identifiers that should mark ASN.1 key objects. This 
technique yielded only false positives on our test system. 

Finally, we experimented with a new method, searching 
for identifying features of the DER-encoding itself. We 
looked for the sequence identifier (0x30) followed a few 
bytes later by the DER encoding of the RSA version 
number and then by the beginning of the DER encoding 
of the next field (02 01 00 02). This method found several 
copies of the server’s private key, and no false positives. 
To locate keys in decayed memory images, we can adapt 
this technique to search for sequences of bytes with low 
Hamming distance to these markers and check that the 
subsequent bytes satisfy some heuristic entropy bound. 


7 Attacking Encrypted Disks 


Encrypting hard drives is an increasingly common coun- 
termeasure against data theft, and many users assume that 
disk encryption products will protect sensitive data even 
if an attacker has physical access to the machine. A Cal- 
ifornia law adopted in 2002 [10] requires disclosure of 
possible compromises of personal information, but offers 
a safe harbor whenever data was “encrypted.” Though 
the law does not include any specific technical standards, 
many observers have recommended the use of full-disk 
or file system encryption to obtain the benefit of this safe 
harbor. (At least 38 other states have enacted data breach 
notification legislation [32].) Our results below suggest 
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that disk encryption, while valuable, is not necessarily a 
sufficient defense. We find that a moderately skilled at- 
tacker can circumvent many widely used disk encryption 
products if a laptop is stolen while it is powered on or 
suspended. 

We have applied some of the tools developed in this 
paper to attack popular on-the-fly disk encryption sys- 
tems. The most time-consuming parts of these tests were 
generally developing system-specific attacks and setting 
up the encrypted disks. Actually imaging memory and 
locating keys took only a few minutes and were almost 
fully automated by our tools. We expect that most disk 
encryption systems are vulnerable to such attacks. 


BitLocker BitLocker, which is included with some ver- 
sions of Windows Vista, operates as a filter driver that 
resides between the file system and the disk driver, en- 
crypting and decrypting individual sectors on demand. 
The keys used to encrypt the disk reside in RAM, in 
scheduled form, for as long as the disk is mounted. 

In a paper released by Microsoft, Ferguson [21] de- 
scribes BitLocker in enough detail both to discover the 
roles of the various keys and to program an independent 
implementation of the BitLocker encryption algorithm 
without reverse engineering any software. BitLocker uses 
the same pair of AES keys to encrypt every sector on the 
disk: a sector pad key and a CBC encryption key. These 
keys are, in turn, indirectly encrypted by the disk’s master 
key. To encrypt a sector, the plaintext is first XORed 
with a pad generated by encrypting the byte offset of the 
sector under the sector pad key. Next, the data is fed 
through two diffuser functions, which use a Microsoft- 
developed algorithm called Elephant. The purpose of 
these un-keyed functions is solely to increase the proba- 
bility that modifications to any bits of the ciphertext will 
cause unpredictable modifications to the entire plaintext 
sector. Finally, the data is encrypted using AES in CBC 
mode using the CBC encryption key. The initialization 
vector is computed by encrypting the byte offset of the 
sector under the CBC encryption key. 

We have created a fully-automated demonstration at- 
tack called BitUnlocker. It consists of an external USB 
hard disk containing Linux, a custom SYSLINUX-based 
bootloader, and a FUSD [20] filter driver that allows Bit- 
Locker volumes to be mounted under Linux. To use 
BitUnlocker, one first cuts the power to a running Win- 
dows Vista system, connects the USB disk, and then re- 
boots the system off of the external drive. BitUnlocker 
then automatically dumps the memory image to the ex- 
ternal disk, runs keyfind on the image to determine 
candidate keys, tries all combinations of the candidates 
(for the sector pad key and the CBC encryption key), and, 
if the correct keys are found, mounts the BitLocker en- 
crypted volume. Once the encrypted volume has been 
mounted, one can browse it like any other volume in 
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Linux. On a modern laptop with 2 GB of RAM, we found 
that this entire process took approximately 25 minutes. 

BitLocker differs from other disk encryption products 
in the way that it protects the keys when the disk is not 
mounted. In its default “basic mode,” BitLocker protects 
the disk’s master key solely with the Trusted Platform 
Module (TPM) found on many modern PCs. This config- 
uration, which may be quite widely used [21], is particu- 
larly vulnerable to our attack, because the disk encryption 
keys can be extracted with our attacks even if the com- 
puter is powered off for a long time. When the machine 
boots, the keys will be loaded into RAM automatically 
(before the login screen) without the entry of any secrets. 

It appears that Microsoft is aware of this problem [31] 
and recommends configuring BitLocker in “advanced 
mode,” where it protects the disk key using the TPM 
along with a password or a key on a removable USB 
device. However, even with these measures, BitLocker 
is vulnerable if an attacker gets to the system while the 
screen is locked or the computer is asleep (though not if 
itis hibernating or powered off). 


FileVault Apple’s FileVault disk encryption software 
has been examined and reverse-engineered in some de- 
tail [44]. In Mac OS X 10.4, File Vault uses 128-bit AES in 
CBC mode. A user-supplied password decrypts a header 
that contains both the AES key and a second key kz used 
to compute IVs. The IV for a disk block with logical 
index J is computed as HMAC-SHA1,, (J). 

We used our EFI memory imaging program to ex- 
tract a memory image from an Intel-based Macintosh 
system with a FileVault volume mounted. Our keyfind 
program automatically identified the File Vault AES key, 
which did not contain any bit errors in our tests. 

With the recovered AES key but not the IV key, we 
can decrypt 4080 bytes of each 4096 byte disk block (all 
except the first AES block). The IV key is present in mem- 
ory. Assuming no bits in the IV key decay, an attacker can 
identify it by testing all 160-bit substrings of memory to 
see whether they create a plausible plaintext when XORed 
with the decryption of the first part of the disk block. The 
AES and IV keys together allow full decryption of the 
volume using programs like vilefault [45]. 

In the process of testing FileVault, we discovered that 
Mac OS X 10.4 and 10.5 keep multiple copies of the 
user’s login password in memory, where they are vul- 
nerable to imaging attacks. Login passwords are often 
used to protect the default keychain, which may protect 
passphrases for File Vault disk images. 


TrueCrypt TrueCrypt is a popular open-source disk 
encryption product for the Windows, Mac OS, and Linux 
platforms. It supports a variety of ciphers, including AES, 
Serpent, and Twofish. In version 4, all ciphers used LRW 
mode; in version 5, they use XTS mode (see Section 5.3). 
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TrueCrypt stores a cipher key and a tweak key in the 
volume header for each disk, which is then encrypted 
with a separate key derived from a user-entered password. 

We tested TrueCrypt versions 4.3a and 5.0a running on 
a Linux system. We mounted a volume encrypted with 
a 256-bit AES key, then briefly cut power to the system 
and used our memory imaging tools to record an image of 
the retained memory data. In both cases, our keyfind 
program was able to identify the 256-bit AES encryption 
key, which did not contain any bit errors. For TrueCrypt 
5.0a, key find was also able to recover the 256-bit AES 
XTS tweak key without errors. 

To decrypt TrueCrypt 4 disks, we also need the LRW 
tweak key. We observed that TrueCrypt 4 stores the LRW 
key in the four words immediately preceding the AES key 
schedule. In our test memory image, the LRW key did 
not contain any bit errors. (Had errors occurred, we could 
have recovered the correct key by applying the techniques 
we developed in Section 5.3.) 


dm-crypt Linux kernels starting with 2.6 include built- 
in support for dm-crypt, an on-the-fly disk encryption 
subsystem. The dm-crypt subsystem handles a variety of 
ciphers and modes, but defaults to 128-bit AES in CBC 
mode with non-keyed IVs. 

We tested a dm-crypt volume created and mounted 
using the LUKS (Linux Unified Key Setup) branch of 
the crypt setup utility and kernel version 2.6.20. The 
volume used the default AES-CBC format. We briefly 
powered down the system and captured a memory image 
with our PXE kernel. Our key find program identified 
the correct 128-bit AES key, which did not contain any 
bit errors. After recovering this key, an attacker could 
decrypt and mount the dm-crypt volume by modifying 
the crypt setup program to allow input of the raw key. 


Loop-AES Loop-AES is an on-the-fly disk encryption 
package for Linux systems. In its recommended con- 
figuration, it uses a so-called “multi-key-v3” encryption 
mode, in which each disk block is encrypted with one 
of 64 encryption keys. By default, it encrypts sectors 
with AES in CBC mode, using an additional AES key to 
generate IVs. 

We configured an encrypted disk with Loop-AES ver- 
sion 3.2b using 128-bit AES encryption in “multi-key-v3” 
mode. After imaging the contents of RAM, we applied 
our keyfind program, which revealed the 65 AES keys. 
An attacker could identify which of these keys correspond 
to which encrypted disk blocks by performing a series 
of trial decryptions. Then, the attacker could modify the 
Linux losetup utility to mount the encrypted disk with 
the recovered keys. 

Loop-AES attempts to guard against the long-term 
memory burn-in effects described by Gutmann [23] and 
others. For each of the 65 AES keys, it maintains two 
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copies of the key schedule in memory, one normal copy 
and one with each bit inverted. It periodically swaps these 
copies, ensuring that every memory cell stores a 0 bit for 
as much time as it stores a | bit. Not only does this fail to 
prevent the memory remanence attacks that we describe 
here, but it also makes it easier to identify which keys be- 
long to Loop-AES and to recover the keys in the presence 
of memory errors. After recovering the regular AES key 
schedules using a program like keyfind, the attacker 
can search the memory image for the inverted key sched- 
ules. Since very few programs maintain both regular and 
inverted key schedules in this way, those keys are highly 
likely to belong to Loop-AES. Having two related copies 
of each key schedule provides additional redundancy that 
can be used to identify which bit positions are likely to 
contain errors. 


8 Countermeasures and their Limitations 


Memory imaging attacks are difficult to defend against 
because cryptographic keys that are in active use need to 
be stored somewhere. Our suggested countermeasures fo- 
cus on discarding or obscuring encryption keys before an 
adversary might gain physical access, preventing memory- 
dumping software from being executed on the machine, 
physically protecting DRAM chips, and possibly making 
the contents of memory decay more readily. 


Scrubbing memory Countermeasures begin with ef- 
forts to avoid storing keys in memory. Software should 
overwrite keys when they are no longer needed, and 
it should attempt to prevent keys from being paged to 
disk. Runtime libraries and operating systems should 
clear memory proactively; Chow et al. show that this 
precaution need not be expensive [13]. Of course, these 
precautions cannot protect keys that must be kept in mem- 
ory because they are still in use, such as the keys used by 
encrypted disks or secure web servers. 

Systems can also clear memory at boot time. Some 
PCs can be configured to clear RAM at startup via a de- 
structive Power-On Self-Test (POST) before they attempt 
to load an operating system. If the attacker cannot by- 
pass the POST, he cannot image the PC’s memory with 
locally-executing software, though he could still physi- 
cally move the memory chips to different computer with 
a more permissive BIOS. 


Limiting booting from network or removable media 
Many of our attacks involve booting a system via the 
network or from removable media. Computers can be 
configured to require an administrative password to boot 
from these sources. We note, however, that even if a 
system will boot only from the primary hard drive, an 
attacker could still swap out this drive, or, in many cases, 
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reset the computer’s NVRAM to re-enable booting from 
removable media. 


Suspending a system safely Our results show that sim- 
ply locking the screen of a computer (i.e., keeping the sys- 
tem running but requiring entry of a password before the 
system will interact with the user) does not protect the con- 
tents of memory. Suspending a laptop’s state (“sleeping”) 
is also ineffective, even if the machine enters screen-lock 
on awakening, since an adversary could simply awaken 
the laptop, power-cycle it, and then extract its memory 
state. Suspending-to-disk (“hibernating”) may also be 
ineffective unless an externally-held secret is required to 
resume normal operations. 

With most disk encryption systems, users can protect 
themselves by powering off the machine completely when 
it is not in use. (BitLocker in “basic” TPM mode remains 
vulnerable, since the system will automatically mount the 
disk when the machine is powered on.) Memory con- 
tents may be retained for a short period, so the owner 
should guard the machine for a minute or so after re- 
moving power. Though effective, this countermeasure is 
inconvenient, since the user will have to wait through the 
lengthy boot process before accessing the machine again. 

Suspending can be made safe by requiring a password 
or other external secret to reawaken the machine, and 
encrypting the contents of memory under a key derived 
from the password. The password must be strong (or 
strengthened), as an attacker who can extract memory 
contents can then try an offline password-guessing attack. 
If encrypting all of memory is too expensive, the system 
could encrypt only those pages or regions containing im- 
portant keys. Some existing systems can be configured to 
suspend safely in this sense, although this is often not the 
default behavior [5]. 


Avoiding precomputation Our attacks show that using 
precomputation to speed cryptographic operations can 
make keys more vulnerable. Precomputation tends to lead 
to redundant storage of key information, which can help 
an attacker reconstruct keys in the presence of bit errors, 
as described in Section 5. 

Avoiding precomputation may hurt performance, as po- 
tentially expensive computations will be repeated. (Disk 
encryption systems are often implemented on top of OS- 
and drive-level caches, so they are more performance- 
sensitive than might be assumed.) Compromises are pos- 
sible; for example, precomputed values could be cached 
for a predetermined period of time and discarded if not 
re-used within that interval. This approach accepts some 
vulnerability in exchange for reducing computation, a 
sensible tradeoff in some situations. 


Key expansion Another defense against key reconstruc- 
tion is to apply some transform to the key as it is stored in 
memory in order to make it more difficult to reconstruct in 
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the case of errors. This problem has been considered from 
a theoretical perspective; Canetti et al. [11] define the no- 
tion of an exposure-resilient function whose input remains 
secret even if all but some small fraction of the output is 
revealed, and show that the existence of this primitive is 
equivalent to the existence of one-way functions. 

In practice, suppose we have a key K which is not 
currently in use but will be needed later. We cannot 
overwrite the key but we want to make it more resistant 
to reconstruction. One way to do this is to allocate a large 
B-bit buffer, fill the buffer with random data R, then store 
K @H(R) where H is a hash function such as SHA-256. 

Now suppose there is a power-cutting attack which 
causes d of the bits in this buffer to be flipped. If the hash 
function is strong, the adversary must search a space of 
size ( ) to discover which bits were flipped of the 
roughly B/2 that could have decayed. If B is large, this 
search will be prohibitive even when d is relatively small. 

In principle, all keys could be stored in this way, re- 
computed when in use, and deleted immediately after. 
Alternatively, we could sometimes keep keys in memory, 
introducing the precomputation tradeoff discussed above. 

For greater protection, the operating system could per- 
form tests to identify memory locations that are especially 
quick to decay, and use these to store key material. 


Physical defenses Some of our attacks rely on physical 
access to DRAM chips or modules. These attacks can 
be mitigated by physically protecting the memory. For 
example, DRAM modules could be locked in place inside 
the machine, or encased in a material such as epoxy to 
frustrate attempts to remove or access them. Similarly, the 
system could respond to low temperatures or opening of 
the computer’s case by attempting to overwrite memory, 
although these defenses require active sensor systems with 
their own backup power supply. Many of these techniques 
are associated with specialized tamper-resistant hardware 
such as the IBM 4758 coprocessor [18, 41] and could 
add considerable cost to a PC. However, a small amount 
of memory soldered to a motherboard could be added at 
relatively low cost. 


Architectural changes Some countermeasures try to 
change the machine’s architecture. This will not help 
on existing machines, but it might make future machines 
more secure. 

One approach is to find or design DRAM systems that 
lose their state quickly. This might be difficult, given 
the tension between the desire to make memory decay 
quickly and the desire to keep the probability of decay 
within a DRAM refresh interval vanishingly small. 

Another approach is to add key-store hardware that 
erases its state on power-up, reset, and shutdown. This 
would provide a safe place to put a few keys, though 
precomputation of derived keys would still pose a risk. 
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Others have proposed architectures that would routinely 
encrypt the contents of memory for security purposes [28, 
27, 17]. These would apparently prevent the attacks we 
describe, as long as the encryption keys were destroyed 
on reset or power loss. 


Encrypting in the disk controller Another approach 
is to encrypt data in the hard disk controller hardware, as 
in Full Disk Encryption (FDE) systems such as Seagate’s 
“DriveTrust” technology [38]. 


In its basic form, this approach uses a write-only key 
register in the disk controller, into which the software 
can write a symmetric encryption key. Data blocks are 
encrypted, using the key from the key register, before 
writing to the disk. Similarly, blocks are decrypted after 
reading. This allows encrypted storage of all blocks on a 
disk, without any software modifications beyond what is 
required to initialize the key register. 


This approach differs from typical disk encryption sys- 
tems in that encryption and decryption are done by the 
disk controller rather than by software in the main CPU, 
and that the main encryption keys are stored in the disk 
controller rather than in DRAM. 


To be secure, such a system must ensure that the key 
register is erased whenever a new operating system is 
booted on the computer; otherwise, an attacker can reboot 
into a malicious kernel that simply reads the disk contents. 
For similar reasons, the system must also ensure that the 
key register is erased whenever an attacker attempts to 
move the disk controller to another computer (even if the 
attacker maintains power while doing so). 


Some systems built more sophisticated APIs, imple- 
mented by software on the disk controller, on top of 
this basic facility. Such APIs, and their implementation, 
would require further security analyses. 


We have not evaluated any specific systems of this type. 
We leave such analyses for future work. 


Trusted computing Trusted Computing hardware, in 
the form of Trusted Platform Modules (TPMs) [42] is now 
deployed in some personal computers. Though useful 
against some attacks, today’s Trusted Computing hard- 
ware does not seem to prevent the attacks described here. 


Deployed TCG TPMs do not implement bulk encryp- 
tion. Instead, they monitor boot history in order to decide 
(or help other machines decide) whether it is safe to store 
a key in RAM. If a software module wants to use a key, 
it can arrange that the usable form of that key will not 
be stored in RAM unless the boot process has gone as 
expected [31]. However, once the key is stored in RAM, 
it is subject to our attacks. TPMs can prevent a key from 
being loaded into memory for use, but they cannot prevent 
it from being captured once it is in memory. 
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9 Conclusions 


Contrary to popular belief, DRAMs hold their values 
for surprisingly long intervals without power or refresh. 
Our experiments show that this fact enables a variety of 
security attacks that can extract sensitive information such 
as cryptographic keys from memory, despite the operating 
system’s efforts to protect memory contents. The attacks 
we describe are practical—for example, we have used 
them to defeat several popular disk encryption systems. 

Other types of software may be similarly vulnerable. 
DRM systems often rely on symmetric keys stored in 
memory, which may be recoverable using the techniques 
outlined in our paper. As we have shown, SSL-enabled 
web servers are vulnerable, since they often keep in mem- 
ory private keys needed to establish SSL sessions. Fur- 
thermore, methods similar to our key-finder would likely 
be effective for locating passwords, account numbers, or 
other sensitive data in memory. 

There seems to be no easy remedy for these vulnera- 
bilities. Simple software changes have benefits and draw- 
backs; hardware changes are possible but will require 
time and expense; and today’s Trusted Computing tech- 
nologies cannot protect keys that are already in memory. 
The risk seems highest for laptops, which are often taken 
out in public in states that are vulnerable to our attacks. 
These risks imply that disk encryption on laptops, while 
beneficial, does not guarantee protection. 

Ultimately, it might become necessary to treat DRAM 
as untrusted, and to avoid storing sensitive data there, but 
this will not be feasible until architectures are changed to 
give software a safe place to keep its keys. 
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Abstract 


The inability of humans to generate and remember strong 
secrets makes it difficult for people to manage crypto- 
graphic keys. To address this problem, numerous pro- 
posals have been suggested to enable a human to repeat- 
ably generate a cryptographic key from her biometrics, 
where the strength of the key rests on the assumption 
that the measured biometrics have high entropy across 
the population. In this paper we show that, despite the 
fact that several researchers have examined the security 
of BKGs, the common techniques used to argue the se- 
curity of practical systems are lacking. To address this 
issue we reexamine two well known, yet sometimes mis- 
understood, security requirements. We also present an- 
other that we believe has not received adequate attention 
in the literature, but is essential for practical biometric 
key generators. To demonstrate that each requirement 
has significant importance, we analyze three published 
schemes, and point out deficiencies in each. For exam- 
ple, in one case we show that failing to meet a require- 
ment results in a construction where an attacker has a 
22% chance of finding ostensibly 43-bit keys on her first 
guess. In another we show how an attacker who com- 
promises a user’s cryptographic key can then infer that 
user’s biometric, thus revealing any other key generated 
using that biometric. We hope that by examining the pit- 
falls that occur continuously in the literature, we enable 
researchers and practitioners to more accurately analyze 
proposed constructions. 


1 Introduction 


While cryptographic applications vary widely in terms of 
assumptions, constructions, and goals, all require crypto- 
graphic keys. In cases where a computer should not be 
trusted to protect cryptographic keys—as in laptop file 
encryption, where keeping the key on the laptop obvi- 
ates the utility of the file encryption—the key must be 
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input by its human operator. It is well known, however, 
that humans have difficulty choosing and remembering 
strong secrets (e.g., [2, 14]). As a result, researchers have 
devoted significant effort to finding input that has suffi- 
cient unpredictability to be used in cryptographic appli- 
cations, but that remains easy for humans to regenerate 
reliably. One of the more promising suggestions in this 
direction are biometrics—characteristics of human phys- 
iology or behavior. Biometrics are attractive as a means 
for key generation as they are easily reproducible by the 
legitimate user, yet potentially difficult for an adversary 
to guess. 


There have been numerous proposals for generating 
cryptographic keys from biometrics (e.g., [33, 34, 28]). 
At a high level, these Biometric Cryptographic Key Gen- 
erators, or BKGs, follow a similar design: during an en- 
rollment phase, biometric samples from a user are col- 
lected; statistical functions, or features, are applied to the 
samples; and some representation of the output of these 
features is stored in a data structure called a biometric 
template. Later, the same user can present another sam- 
ple, which is processed with the stored template to repro- 
duce a key. A different user, however, should be unable 
to produce that key. Since the template itself is generally 
stored where the key is used (e.g., in a laptop file encryp- 
tion application, on the laptop), a template must not leak 
any information about the key that it is used to recon- 
struct. That is, the threat model admits the capture of the 
template by the adversary; otherwise the template could 
be the cryptographic key itself, and biometrics would not 
be needed to reconstruct the key at all. 

Generally, one measures the strength of a crypto- 
graphic key by its entropy, which quantifies the amount 
of uncertainty in the key from an adversary’s point of 
view. If one regards a key generator as drawing an ele- 
ment uniformly at random from a large set, then the en- 
tropy of the keys can be easily computed as the base-two 
logarithm of the size of the set. Computing the entropy of 
keys output by a concrete instantiation of a key genera- 
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tor, however, is non-trivial because “choosing uniformly 
at random” is difficult to achieve in practice. This is in 
part due to the fact that the key generator’s source of ran- 
domness may be based on information that is leaked by 
external sources. For instance, an oft-cited flaw in Ker- 
beros version 4 allowed adversaries to guess ostensibly 
56-bit DES session keys in only 27° guesses [13]. The 
problem stemmed from the fact that the seeding infor- 
mation input to the key generator was related to infor- 
mation that could be easily inferred by an adversary. In 
other words, this auxiliary information greatly reduced 
the entropy of the key space. 

In the case of biometric key generators, where the ran- 
domness used to generate the keys comes from a user’s 
biometric and is a function of the particular features used 
by the system, the aforementioned problems are com- 
pounded by several factors. For instance, in the case of 
certain biometric modalities, it is known that population 
Statistics can be strong indicators of a specific user’s bio- 
metric [10, 36, 3]. In other words, depending on the type 
of biometric and the set of features used by the BKG, 
access to population statistics can greatly reduce the en- 
tropy of a user’s biometric, and consequently, reduce the 
entropy of her key. Moreover, templates could also leak 
information about the key. To complicate matters, in the 
context of biometric key generation, in addition to eval- 
uating the strength of the key, one must also consider 
the privacy implications associated with using biomet- 
rics. Indeed, the protection of a user’s biometric infor- 
mation is crucial, not only to preserve privacy, but also 
to enable that user to reuse the biometric key generator 
to manage a new key. We argue that this concern for pri- 
vacy mandates not only that the template protect the bio- 
metric, but also that the keys output by a BKG not leak 
information about the biometric. Otherwise, the compro- 
mise of a key might render the user’s biometric unusable 
for key generation thereafter. 

The goal of this work is to distill the seemingly inter- 
twined and complex security requirements of biometric 
key generators into a small set of requirements that fa- 
cilitate practical security analyses by designers. Specifi- 
cally, the contributions of this paper are: 


I. The specification of three practical requirements 
that allow designers to ensure that a BKG ensures 
the privacy of a user’s biometric and generates keys 
that are suitable for cryptographic applications. 


II. The analyses of three published BKGs. These are 
contributions in their own right, but more impor- 
tantly serve as concrete evidence of the importance 
of the requirements. 


Ill. The description of Guessing Distance, a new heuris- 
tic measure that, given empirical data, can quickly 
estimate the security afforded by a BKG. 


17th USENIX Security Symposium 


IV. Discussion of common pitfalls and subleties in cur- 
rent standards for empirical evaluation. 


Throughout this paper we focus on the importance of 
considering adversaries who have access to public in- 
formation, such as templates, when performing security 
evaluations. We hope that our observations will pro- 
mote critical analyses of BKGs and temper the spread 
of flawed (or incorrectly evaluated) proposals. 


2 Related Work 


To our knowledge, Soutar and Tomko [34] were the first 
to propose biometric key generation. Davida et al. [9] 
proposed an approach that uses iris codes, which are be- 
lieved to have the highest entropy of all commonly-used 
biometrics. However, iris code collection can be consid- 
ered somewhat invasive and the use of majority-decoding 
for error correction—a central ingredient of the Davida 
et al. approach—has been argued to have limited use in 
practice [16]. 

Monrose et al. proposed the first practical BKG that 
exploits behavioral (versus physiological) biometrics for 
key generation [29]. Their technique uses keystroke la- 
tencies to increase the entropy of standard passwords. 
Their construction yields a key at least as secure as the 
password alone, and an empirical analysis showed that 
their approach increases the workload of an attacker by a 
multiplicative factor of up to 2'°. A similar approach was 
used to generate cryptographic keys from voice [28, 27]. 
Many constructions followed those of Monrose et al., us- 
ing biometrics such as face [15], fingerprints [33, 39], 
handwriting [40, 17] and iris codes [16, 45]. Unfortu- 
nately, many are susceptible to attacks. Hill-climbing at- 
tacks have been leveraged against fingerprint, face, and 
handwriting-based biometric systems [1, 37, 43] by ex- 
ploiting information leaked during the reconstruction of 
the key from the biometric template. 

There has also been an emergence of generative at- 
tacks against biometrics [5, 23], which use auxiliary in- 
formation such as population statistics along with limited 
information about a target user’s biometric. The attacks 
we present in this paper are different from generative at- 
tacks because we assume that adversaries only have ac- 
cess to templates and auxiliary information. Our attacks, 
therefore, capture much more limited, and arguably more 
realistic, adversaries. Despite such limited information, 
we show how an attacker can recover a target user’s key 
with high likelihood. 

There has also been recent theoretical work to for- 
malize particular aspects of biometric key generators. 
The idea of fuzzy cryptography was first introduced by 
Juels and Wattenberg [21], who describe a commitment 
scheme that supports noise-tolerant decommitments. In 
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Section 7 we provide a concrete analysis of a pub- 
lished construction that highlights the pitfalls of using 
fuzzy commitments as biometric key generators. Fur- 
ther work included a fuzzy vault [20], which was later 
analyzed as an instance of a secure-sketch that can be 
used to build fuzzy extractors [11, 6, 12, 22]. Fuzzy 
extractors treat biometric features as non-uniformly dis- 
tributed, error-prone sources and apply error-correction 
algorithms and randomness extractors [18, 30] to gener- 
ate random strings. 

Fuzzy cryptography has made important contributions 
by specifying formal security definitions with which 
BKGs can be analyzed. Nevertheless, there remains a 
gap between theoretical soundness and practical systems. 
For instance, while fuzzy extractors can be effectively 
used as a component in a larger biometric key generation 
system, they do not capture all the practical requirements 
of a BKG. In particular, it is unclear whether known con- 
structions can correct the kinds of errors typically gen- 
erated by humans, especially in the case of behavioral 
biometrics. Moreover, fuzzy extractors require biometric 
inputs with high min-entropy but do not address how to 
select features that achieve this requisite level of entropy. 
Since this is an inherently empirical question, much of 
our work is concerned with how to experimentally eval- 
uate the entropy available in a biometric. 

Lastly, Jain et al. enumerate possible attacks against 
biometric templates and discuss several practical ap- 
proaches that increase template security [19]. Similarly, 
Mansfield and Wayman discuss a set of best practices 
that may be used to measure the security and usability of 
biometric systems [24]. While these works describe spe- 
cific attacks and defenses against systems, they do not 
address biometric key generators and the unique require- 
ments they demand. 


3 Biometric Key Generators 


Before we can argue about how to accurately assess bio- 
metric key generators (BKGs), we first define the algo- 
rithms and components associated with a BKG. These 
definitions are general enough to encompass most pro- 
posed BKGs. 

BKGs are generally composed of two algorithms, an 
enrollment algorithm (Enroll) and a key-generation algo- 
rithm (KeyGen): 


e Enroll(6,,...,6¢): The enroll algorithm is a prob- 
abilistic algorithm that accepts as input a number 
of biometric samples (61,...,6¢), and outputs a 
template (7) and a cryptographic key (K). In the 
event that 6,,...,8, do not meet some predeter- 
mined criteria, the enroll algorithm might output the 
failure symbol L. 
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e KeyGen(,7): The key generation algorithm ac- 
cepts as input one biometric sample (4), and a tem- 
plate (J). The algorithm outputs either a crypto- 
graphic key (XK), or the failure symbol | if B cannot 
be used to create a key. 


The enrollment algorithm estimates the variation in- 
herent to a particular user’s biometric reading and com- 
putes information needed to error-correct a new sample 
that is sufficiently close to the enrollment samples. Enroll 
encodes this information into a template and outputs the 
template and the associated key. The key-generation al- 
gorithm uses the template output by the enrollment al- 
gorithm and a new biometric sample to output a key. If 
the provided sample is sufficiently similar to those pro- 
vided during enrollment, then KeyGen and Enroll output 
the same keys. 

Generally speaking, there are four classes of informa- 
tion associated with a BKG. 


e The Biometric (6): A biometric is a measurement 
of a person’s behavior or physiology. A BKG ex- 
tracts B as algorithmically interpretable representa- 
tions (e.g., a set of signals). The BKG typically ap- 
plies statistical functions, or features (¢1,...,@n), 
to the representations, and uses the output to either 
derive [17, 41] or lock [33, 16, 38] a cryptographic 
key. 


e A Template (7): A template is any piece of in- 
formation that is stored on the system for the pur- 
pose of re-generating the cryptographic key. Tem- 
plates are generally created during an enrollment 
process and stored so that a user can easily recre- 
ate her key. For all practical purposes, templates 
must be considered publicly available. Note that 
this assumption implies that more standard biomet- 
ric templates, which are typically employed for au- 
thentication purposes and are simply the encoding 
of a biometric [42], cannot be used securely in this 
setting. 


The Key (XK): A cryptographic key that is derived 
from (or locked by) one or more biometric samples 
during an enrollment phase. The key may later be 
regenerated using another biometric sample that is 
“close” to the original samples and the template that 
was also output during enrollment. 


Auxiliary Information (A): Auxiliary information 
encompasses any public information not intended 
to be used for key-derivation purposes but that is 
still readily available to an adversary. Auxiliary in- 
formation is specified with respect to one user and 
includes any biometric, template, or key other than 
those associated with the user in question. It could 
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also include any other information about the envi- 
ronment that might leak information about the bio- 
metric, or results of using the key. 


For the remainder of the paper if a component of a 
BKG is associated with a specific user, then we subscript 
the information with the user’s unique identifier. So, for 
example, B,,, is u’s biometric and A, is auxiliary infor- 
mation derived from all users u’ # u. 


3.1 Evaluation Recommendations 


At a high level, the evaluation of a BKG requires design- 
ers to show that two properties hold: correctness and se- 
curity. Intuitively, a scheme that achieves correctness is 
one that is usable for a high percentage of the population. 
That is, the biometric of choice can be reliably extracted 
to within some threshold of tolerance, and when com- 
bined with the template the correct key is output with 
high probability. As correctness is well understood, and 
is always presented when discussing the feasibility of a 
proposed BKG, we do not address it further. 

In the context of biometric key generation, security is 
not as easily defined as correctness. Loosely speaking, 
a secure BKG outputs a key that “looks random” to any 
adversary that cannot guess the biometric. In addition, 
the templates and keys derived by the BKG should not 
leak any information about the biometric that was used 
to create them. We enumerate a set of three security 
requirements for biometric key generators, and examine 
the components that should be analyzed mathematically 
(i.e., the template and key) and empirically (i.e., the bio- 
metric and auxiliary information). While the necessity of 
the first two requirements has been understood to some 
degree, we will highlight and analyze how previous eval- 
uations of these requirements are lacking. Additionally, 
we discuss a requirement that is often overlooked in the 
practical literature, but one which we believe is necessary 
for a secure and practical BKG. 

We consider a BKG secure if it meets the following 
three requirements for each enrollable user in a popula- 
tion: 


e Key Randomness (REQ-KR): The keys output by a 
BKG appear random to any adversary who has ac- 
cess to auxiliary information and the template used 
to derive the key. For instance, we might require 
that the key be statistically or computationally in- 
distinguishable from random. 


e Weak Biometric Privacy (REQ-WBP): An adver- 
sary learns no useful information about a biometric 
given auxiliary information and the template used 
to derive the key. For instance, no computationally 
bounded adversary should be able to compute any 
function of the biometric. 
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e Strong Biometric Privacy (REQ-SBP): An adver- 
sary learns no useful information about a biomet- 
ric given auxiliary information, the template used to 
derive the key, and the key itself. For instance, no 
computationally bounded adversary should be able 
to compute any function of the biometric. 


The necessity of REQ-KR and REQ-WBP is well 
known, and indeed many proposals make some sort 
of effort to argue security along these lines (see, e.g., 
[29, 11]). However, many different approaches are used 
to make these arguments. Some take a cryptographi- 
cally formal approach, whereas others provide an empiri- 
cal evaluation aimed at demonstrating that the biometrics 
and the generated keys have high entropy. Unfortunately, 
the level of rigor can vary between works, and differ- 
ences in the ways REQ-KR and REQ-WBP are typically 
argued make it difficult to compare approaches. Also, 
it is not always clear that the empirical assumptions re- 
quired by the cryptographic algorithms of the BKG can 
be met in practice. 

Even more problematic is that many approaches for 
demonstrating biometric security merely provide some 
sort of measure of entropy of a biometric (or key) based 
on variation across a population. For example, one com- 
mon approach is to compute biometric features for each 
user in a population, and compute the entropy over the 
output of these features. However, such analyses are 
generally lacking on two counts. For one, if the corre- 
lation between features is not accounted for, the reported 
entropy of the scheme being evaluated could be much 
higher than what an adversary must overcome in prac- 
tice. Second, such techniques fail to compute entropy 
as a function of the biometric templates, which we ar- 
gue should be assumed to be publicly available. Con- 
sequently, such calculations would declare a BKG “‘se- 
cure” even if, say, the template leaked information about 
the derived key. For example, suppose that a BKG uses 
only one feature and simply quantizes the feature space, 
outputting as a key the region of the feature space that 
contains the majority of the measurements of a specific 
user’s feature. The quantization is likely to vary between 
users, and so the partitioning information would need to 
be stored in each user’s template. Possession of the tem- 
plate thus reduces the set of possible keys, as it defines 
how the feature space is partitioned. 

As far as we know, the notion of Strong Biometric 
Privacy (REQ-SBP) has only been considered recently, 
and only in a theoretical setting [12]. Even the origi- 
nal definitions of fuzzy extractors [11, Definition 3] do 
not explicitly address this requirement. Unfortunately, 
REQ-SBP has also largely been ignored by the designers 
of practical systems. Perhaps this oversight is due to lack 
of perceived practical motivation—it is not immediately 
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clear that a key could be used to reveal a user’s biomet- 
ric. Indeed, to our knowledge, there have been few, if 
any, concrete attacks that have used keys and templates 
to infer a user’s biometric. We observe, however, that it 
is precisely practical situations that motivate such a re- 
quirement; keys output by a BKG could be revealed for 
any number of reasons in real systems (e.g., side-channel 
attacks against encryption keys, or the verification of a 
MAC). If a key can be used to derive the biometric that 
was used to generate the key, then key recovery poses 
a severe privacy concern. Moreover, key compromise 
would then preclude a user from using this biometric in 
a BKG ever again, as the adversary would be able to 
recreate any key the user makes thereafter. Therefore, 
in Section 7 we provide specific practical motivation for 
this requirement by describing an attack against a well- 
accepted BKG. The attack combines the key and a tem- 
plate to infer a user’s biometric. 

In what follows, we provide practical motivation for 
the importance of each of our three requirements by an- 
alyzing three published BKGs. It is not our goal to fault 
specific constructions, but instead to critique evaluation 
techniques that have become standard practice in the 
field. We chose to analyze these specific BKGs because 
each was argued to be secure using “standard” tech- 
niques. However, we show that since these techniques do 
not address important requirements, each of these con- 
structions exhibit significant weaknesses despite security 
arguments to the contrary. 


4 Biometrics and “Entropy” 


Before continuing further, we note that analyzing the se- 
curity of a biometric key generator is a challenging task. 
A comprehensive approach to biometric security should 
consider sources of auxiliary information, as well as the 
impact of human forgers. Though it may seem imprac- 
tical to consider the latter as a potential threat to a stan- 
dard key generator, skilled humans can be used to gener- 
ate initial forgeries that an algorithmic approach can then 
leverage to undermine the security of the BKG. 

To this point, research has accepted this “adversar- 
ial multiplicity” without examining the consequences in 
great detail. Many works (e.g., [33, 29, 17, 15, 40, 16]) 
report both False Accept Rates (i.e., how often a human 
can forge a biometric) and an estimate of key entropy 
(i.e., the supposed difficulty an algorithm must over- 
come in order to guess a key) without specifically iden- 
tifying the intended adversary. In this work, we focus 
on algorithmic adversaries given their importance in of- 
fline guessing attacks, and because we have already ad- 
dressed the importance of considering human-aided forg- 
eries [4, 5]. While our previous work did not address 
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biometric key generators specifically, those lessons ap- 
ply equally to this case. 

The security of biometric key generators in the face of 
algorithmic adversaries has been argued in several dif- 
ferent ways, and each approach has advantages and dis- 
advantages. Theoretical approaches (e.g., [11, 6]) be- 
gin by assuming that the biometrics have high adversar- 
ial min-entropy (i.e., conditioned on all the auxiliary in- 
formation available to an adversary, the entropy of the 
biometric is still high) and then proceed to distill this 
entropy into a key that is statistically close to uniform. 
However, in practice, it is not always clear how to esti- 
mate the uncertainty of a biometric. In more practical 
settings, guessing entropy [25] has been used to mea- 
sure the strength of keys (e.g., [29, 27, 10]), as it is 
easily computed from empirical data. Unfortunately, as 
we demonstrate shortly, guessing entropy is a summary 
statistic and can thus yield misleading results when com- 
puted over skewed distributions. Yet another common 
approach (e.g., [31, 7, 16, 41, 17]), which has lead to 
somewhat misleading views on security, is to argue key 
strength by computing the Shannon entropy of the key 
distribution over a population. More precisely, if we con- 
sider a BKG that assigns the key K,, to a user u in a pop- 
ulation P, then it is considered “secure” if the entropy of 
the distribution P(K) = |{u € P: Ky = K}|/|P| is 
high. We note, however, that the entropy of the previous 
distribution measures only key uniqueness and says noth- 
ing about how difficult it is for an adversary to guess the 
key. In fact, it is not difficult to design BKGs that output 
keys with maximum entropy in the previous sense, but 
whose keys are easy for an adversary to guess; setting 
Ky, = wis a trivial example. 

To address these issues, we present a new measure that 
is easy to compute empirically and that estimates the dif- 
ficulty an adversary will have in guessing the output of 
a distribution given some related auxiliary distribution. 
It can be used to empirically estimate the entropy of a 
biometric for any adversary that assumes the biomet- 
ric is distributed similarly to the auxiliary distribution. 
Our proposition, Guessing Distance, involves determin- 
ing the number of guesses that an adversary must make 
to identify a biometric or key, and how the number of 
guesses are reduced in light of various forms of auxiliary 
information. 


4.1 Guessing Distance 


We assume that a specific user u induces a distribution 
U over a finite, n-element set 2. We also assume that 
an adversary has access to population statistics that also 
induce a distribution, P, over 2. P could be computed 
from the distributions of other users u’ 4 u. We seek to 
quantify how useful P is at predicting U/. The specifica- 
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tion of 2 varies depending on the BKG being analyzed; 
Q could be a set of biometrics, a set of possible feature 
outputs, or a set of keys. It is up to system designers to 
use the specification of 2 that would most likely be used 
by an adversary. For instance, if the output of features 
are easier to guess than a biometric, then 2) should be de- 
fined as the set of possible feature outputs. Although at 
this point we keep the definition of P and U/ abstract, it is 
important when assessing the security of a construction 
to take as much auxiliary information as possible into ac- 
count when estimating P. We return in Section 5 with an 
example of such an analysis. 

We desire a measure that estimates the number of 
guesses that an adversary will make to find the high- 
probability elements of U/, when guessing by enumerat- 
ing Q starting with the most likely elements as prescribed 
by P. That is, our measure need not precisely capture the 
distance between U and P (as might, say, L-distance 
or Relative Entropy), but rather must capture simply P’s 
ability to predict the most likely elements as described by 
U !. Given a user’s distribution /, and two (potentially 
different) population distributions P; and P2, we would 
like the distance between U/ and P; and U and P2 to be 
the same if and only if P, and P2 prescribe the same 
guessing strategy for a random variable distributed ac- 
cording to U/. For example, consider the distributions U/, 
P, and P2, and the element w € Q such that P;(w) = .9, 
P2(w) = .8, and U(w) = 1. Here, an adversary with ac- 
cess to P, would require the same number of guesses to 
find w as an adversary with access to P2 (one). Thus, we 
would like the distance between U/ and P, and between 
U and P2 to be the same. 


Guessing Distance. Let w* = argmax,,<gU(w). Let 
Lp = (w1,...,Wn) be the elements of 2 ordered such 
that P(w;) > P(wi41) for all i € [1,n — 1]. Define 
t~ and tt to be the smallest index and largest index i 
such that |P(w;) — P(w*)| < 6. The Guessing Distance 
between U/ and P with tolerance 6 is defined as: 


t- +47 
6071.2) Soe = = 





Guessing Distance measures the number of guesses that 
an adversary who assumes that 4 ~ P makes before 
guessing the most likely element as prescribed by 2/ (that 
is, w*)?. We take the average over t~ and t+ as it may 
be the case that several elements may have similar prob- 
ability masses under P. In such a situation, the ordering 
of Lp may be ambiguous, so we report an average mea- 
sure across all equivalent orderings. As U/ and P will 
typically be empirical estimates, we use a tolerance 6 to 
offset small measurement errors when grouping elements 
of similar probability masses. The subscript 6 is ignored 
if 6 =0. 
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Discussion. Intuitively, one can see that this definition 
makes sense by considering the following three cases: 
(1) P is a good indicator of U (i.e., w* = wy); (2) 
P is uniform; and (3) P is a poor indicator of U (ie., 
w* = Wy). In case (1) the adversary clearly benefits 
from using P to guess w*, and this relation is captured 
as GD(U,P) = log1 = 0. In case (2), the adversary 
learns no information about U/ from P and thus would 
be expected to search half of 2. before guessing the cor- 
rect value; indeed GD(U/, P) = log +4". Finally, in case 
(3), a search algorithm based on P would need to enu- 
merate all of 2. before finding w*, and this is reflected by 
GD(U, P) = log 4" = log |Q|. 

An important characteristic of GD is that it compares 
two probability distributions. This allows for a more 
fine-tuned evaluation as one can compute GD for each 
user in the population. To see the overall strength of a 
proposed approach, one might report a CDF of the GD’s 
for each user, or report the minimum over all GD’s in the 
population. 

Guessing Distance is superficially similar to Guessing 
Entropy [25], which is commonly used to compute the 
expected number of guesses it takes to find an average 
element in a set assuming an optimal guessing strategy 
(i.e., first guessing the element with the highest likeli- 
hood, followed by guessing the element with the second 
highest likelihood, etc.) Indeed, one might view Guess- 
ing Distance as an extension of Guessing Entropy (see 
Appendix A); however, we prefer Guessing Distance as 
a measure of security as it provides more information 
about non-uniform distributions over a key space. For 
such distributions, Guessing Entropy is increased by the 
elements that have a low probability, and thus might not 
provide as conservative an estimate of security as de- 
sired. Guessing Distance, on the other hand, can be com- 
puted for each user, which brings to light the insecurity 
afforded by a non-uniform distribution. We provide a 
concrete example of such a case in Appendix A. 





5 The Impact of Public Information on 
Key Randomness 


We now show why templates play a crucial role in the 
computation of key entropy (REQ-KR from Section 3). 
Our analysis brings to light two points: first that tem- 
plates, and in particular, error-correction information, 
can indeed leak a substantial amount of information 
about a key, and thus must be considered when com- 
puting key entropy. Second, we show how standard ap- 
proaches to computing key entropy, even if they were 
to take templates into account, must be conducted with 
care to avoid common pitfalls. Through our analysis we 
demonstrate the flexibility and utility of Guessing Dis- 
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tance. While we focus here on a specific proposal by 
Vielhauer and Steinmetz [40, 41], we argue that our re- 
sults are generally applicable to a host of similar propos- 
als (see, e.g., [44, 7, 35, 17]) that use per-user feature- 
space quantization for error correction. This complicates 
the calculation of entropy and brings to light common 
pitfalls. 


The construction works as follows. Given 50 fea- 
tures ¢1,...,¢59 [40] that map biometric samples to 
the set of non-negative integers, and ¢ enrollment sam- 
ples B,,...,Bz, let A; be the difference between the 
minimum and maximum value of ¢;(B1),...,¢;(Be), 
expanded by a small tolerance. The scheme partitions 
the output range of ¢; into A;-length segments. The 
key is derived by letting L; be the smallest integer in 
the segment that contains the user’s samples, comput- 
ing T; = L;, mod Aj, and setting the ih key element 
G = | eee |. The key is K = cy||...||cs0, and the 
template T is composed of {(Ai,T1),..., (Aso, T's0) }- 
To later extract K given a biometric sample 6’, and a 


4(B’)-Ti 
template T, set c, = ae] and output K’ = 
ci||.--||¢b9- We refer the reader to [41] for details on 
correctness. 


As is the case in many other proposals, Vielhauer et al. 
perform an analysis that addresses requirement REQ-KR 
by arguing that given that the template leaks only error 
correcting information (i.e., the partitioning of the fea- 
ture space) it does not indicate the values c;. To support 
this argument, they conduct an empirical evaluation to 
measure the Shannon entropy of each c;. For each user 
u they derive K,, from 7,, and B,,, then compute the en- 
tropy of each element c; across all users. This analysis 
is a standard estimate of entropy. To see why this is in- 
accurate, consider two different users a and b such that 
a outputs consistent values on feature ¢ and b does not. 
Then the partitioning over ¢’s range differs for each user. 
Thus, even if the mean value of ¢ is the same when mea- 
sured over both a’s and b’s samples, this mean will be 
mapped to different partitions in the feature space, and 
thus, a different key. This implies that computing entropy 
over the c; overestimates security because the mapping 
induced by the templates artificially amplifies the entropy 
of the biometrics. A more realistic estimate of the util- 
ity afforded an adversary by auxiliary information can 
be achieved by fixing a user’s template, and using that 
template to error-correct every other user’s samples to 
generate a list of keys, then measuring how close those 
keys are to the target user’s key. By conditioning the esti- 
mate on the target users template we are able to eliminate 
the artificial inflation of entropy and provide a better es- 
timate of the security afforded by the construction. 
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Analysis. We implemented the construction and tested 
the technique using all of the passphrases in the data 
set we collected in [3], which consists of over 9,000 
writing samples from 47 users. Each user wrote the 
same five passphrases 10-20 times. In our analysis we 
follow the standard approach to isolate the entropy as- 
sociated with the biometric: we compute various en- 
tropy measures using each user’s rendering of the same 
passphrase [29, 5, 36] (this approach is justified as 
user selected passphrases are assumed to have low en- 
tropy). Tolerance values were set such that the approach 
achieved a False Reject Rate (FRR) of 0.1% (as reported 
in [40]) and all outliers and samples from users who 
failed to enroll [24] were removed. 


Figure | shows three different measures of key uncer- 
tainty. The first measure, denoted Standard, is the com- 
mon measure of interpersonal variation as reported in the 
literature (e.g., [17, 7]) using the data from our exper- 
iments. Namely, if the key element c; has entropy H; 
across the entire population, then the entropy of the key 
space is computed as H = yy H;. We also show 
two estimates of guessing distance, the first (GD(U/, P), 
plotted as GD-P) does not take a target user’s template 
into account and P is just the distribution over all other 
users’s keys in the population (the techniques we use to 
compute these estimates are described in Appendix B). 
The second (GD(U, P[T,,]), plotted as GD-U) takes the 
user’s template into account, computing P[T,,] by taking 
the biometrics from all other users in the population, and 
generating keys using 7,,, then computing the distribu- 
tion over these keys. 

Figure 1 shows the CDF of the number of guesses 
that one would expect an adversary to make to find each 
user’s key. There are several important points to take 
away from these results. The first is the common pitfalls 
associated with computing key entropy. The difference 
between GD(U/, P) and the standard measurement indi- 
cates that the standard measurement of entropy (43 bits 
in this case) is clearly misleading—under this view one 
would expect an adversary to make 24 guesses on av- 
erage before finding a key. However, from GD(U, P) it 
is clear that an adversary can do substantially better than 
this. The difference in estimates is due to the fact that 
GD takes into account conditional information between 
features whereas a more standard measure does not. 

The second point is the impact of a user’s template 
on computing GD. We can see by examining GD(U/, P) 
that if we take the usual approach of just computing en- 
tropy over the keys, and ignore each user’s template, we 
would assume only a small probability of guessing a key 
in fewer than 2?! attempts. On the other hand, since the 
templates reduce the possible key space for each user, the 
estimate GD(U/, P|Z,,]) provides a more realistic mea- 
surement. In fact, an adversary with access to population 
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Figure 1: CDF of the guesses required by an adversary to find a key. We compare the Standard metric to two estimates 
of GD, one that uses the target user’s template (GD-U), and one that uses each individual user’s template (GD-P). 


statistics has a 50% chance of guessing a user’s key in 
fewer than 27% attempts, and 15.5% chance guessing a 
key in a single attempt! 

These results also shed light on another pitfall worth 
mentioning—namely, that of reporting an average case 
estimate of key strength. If we take the target user’s tem- 
plate into account in the current construction, 15.5% of 
the keys can be guessed in one attempt despite the es- 
timated Guessing Entropy being approximately 272. In 
summary, this analysis highlights the importance of con- 
ditioning entropy estimates on publicly available tem- 
plates, and how several common entropy measures can 
result in misleading estimates of security. 


6 The Impact of Public Information on 
Weak Biometric Privacy 


Recall that a scheme that achieves Weak Biometric Pri- 
vacy uses templates that do not leak information about 
the biometrics input during enrollment. A standard ap- 
proach to arguing that a scheme achieves REQ-WBP is 
to show (1) auxiliary information leaks little useful in- 
formation about the biometrics, and (2) templates do not 
leak information about a biometric. This can be problem- 
atic as the two steps are generally performed in isolation. 
In our description of REQ-WBP, however, we argue that 
step (2) should actually show that an adversary with ac- 
cess to both templates and auxiliary information should 
learn no information about the biometric. The key dif- 
ference here is that auxiliary information is used in both 
steps (1) and (2). This is essential as it is not difficult to 
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create templates that are secure when considered in iso- 
lation, but are insecure once we consider knowledge de- 
rived from other users (e.g., population-wide statistics). 
In what follows we shed light on this important consid- 
eration by examining the scheme of Hao and Wah [17]. 
While our analysis focuses on their construction, it is per- 
tinent to any BKG that stores partial information about 
the biometric in the template [43, 26]. 


For completeness, we briefly review the construction. 
The BKG generates DSA signing keys from n dynamic 
features associated with handwriting (e.g., pen tip veloc- 
ity or writing time). The range of each feature is quan- 
tized based on a user’s natural variation over the feature. 
Each partition of a feature’s range is assigned a unique 
integer; let p; be the integer that corresponds to the par- 
tition containing the output of feature ¢; when applied 
to the user’s biometric. The signing key is computed as 
K = SHAI(pi||..-||pn). The template stores informa- 
tion that describes the partitions for each feature, as well 
as the (x,y) coordinates that define the pen strokes of 
the enrollment samples, and the verification key corre- 
sponding to K. The (x, y) coordinates of the enrollment 
samples are used as input to the Dynamic Time Warp- 
ing [32] algorithm during subsequent key generation; if 
the provided sample diverges too greatly from the orig- 
inal samples, it is immediately rejected and key genera- 
tion aborted. 


Hao et al. performed a typical analysis of 
REQ-WBP [17]. First, they compute the entropy 
of the features over the entire population of users to 
show that auxiliary information leaks little information 
that could be used to discern the biometric. Second, the 
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Figure 2: Search results against the BKG proposed by Hao et al. [17]. Our search algorithm has a 22% chance of 


finding a user’s key on the first guess. 


template is argued to be secure by making the following 
three observations. First, since the template only 
specifies the partitioning of the range of each feature, the 
template only leaks the variation in each feature, not the 
output. Second, for a computationally bound adversary, 
a DSA verification key leaks no information about 
the DSA signing key. Third, since the BKG employs 
only dynamic features, the static (x,y) coordinates 
leak no relevant information. Note that in this analysis 
the template is analyzed without considering auxiliary 
information. Unfortunately, while by themselves the 
auxiliary information and the templates seem to be 
of little use to an adversary, when taken together, the 
biometric can be easily recovered. 


Analysis. To demonstrate this, we apply the techniques 
of [3] to generate guesses of the user’s biometric sam- 
ples. In [3] we describe a set of statistical measures that 
can be computed using population statistics, and map 
these spatial measures * to the most likely pen speed. 
In that work we assume limited knowledge of the target 
user’s biometric, and compose static samples from the 
user to create a partial forgery, then infer timing informa- 
tion to make a complete forgery. In the current approach, 
we need not assume access to the target user’s biometric 
because the («, y) coordinates of the enrollment samples 
are stored in the template. Thus, we apply our approach 
from [3], to make a guess at the user’s biometric. Then, 
we use an intelligent search algorithm that enumerates 
other biometrics that are “close” to the first guess. The 
algorithm focuses the bulk of its work searching for the 
outputs of the features that exhibit high variance across 
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the population, and reduces the search space by exploit- 
ing conditioning between features. 

To empirically evaluate our attack, we used the same 
data set as in Section 5. Our implementation of the BKG 
had a FRR of 29.2% and a False Accept Rate (FAR) of 
1.7%, which is inline with the FRR/FAR of 28%/1.2% 
reported in [17]. Moreover, if we follow the computation 
of inter-personal variation as described in [17], then we 
would incorrectly conclude that the scheme creates keys 
with over 40 bits of entropy with our data set, which is 
the same estimate provided in [17]. However, this is not 
the case (see Figure 2). In particular, the fact that the 
template leaks information about the biometric enables 
an attack that successfully recreates the key 22% of the 
time on the first try; approximately 50% of the keys are 
correctly identified after making fewer than 21° guesses. 
In summary, the significance of this analysis does not lie 
in the effectiveness of the described attack, but more so 
in the fact that the original analysis failed to take auxil- 
iary information into consideration when evaluating the 
security of the template. 


7 The Impact of Key Compromise on 
Strong Biometric Privacy 


Lastly, we highlight the importance of quantifying the 
privacy of a user’s biometric against adversaries who 
have access to the cryptographic key (i.e., REQ-SBP 
from Section 3). We examine a BKG proposed by Hao et 
al. [16]. The construction generates a random key and 
then “locks” it with a user’s iris code. The construction 
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uses a cryptographic hash function h : {0,1}* — {0,1}° 
and a “concatenated” error correction code consisting of 
an encoding algorithm C' : {0,1}14° — {0,1}?48, and 
the corresponding decoding algorithm D : {0, 1}°48 > 
{0,1}14°. This error correction code is the composition 
of a Reed-Solomon and Hadamard code [16, Section 3]. 
Iris codes are elements in {0, 1}? [8]. 

The BKG works as follows: given a user’s iris code 6, 
select a random string K € {0, 1}4°, and derive the tem- 
plate T = (h(K),B @ C(K)), and output T and K. To 
later derive the key given an iris code B’ and the template 
T = (t1,t2), compute K’ = D(t2 @ B’). If h(K’) = th, 
then output K’, otherwise, fail. If B and B’ are “close” 
to one another, then tz @ B’ is “close” to C(K), perhaps 
differing in only a few bits. The error correcting code 
handles these errors, yielding K’ = K. 

Hao et al. provide a security analysis arguing require- 
ment REQ-KR using both cryptographic reasoning and a 
standard estimate of entropy of the input biometric. That 
is, they provide empirical evidence that auxiliary infor- 
mation cannot be used to guess a target user’s biometric, 
and a cryptographic argument that, assuming the former, 
the template and auxiliary information cannot be used to 
guess a key. They conservatively estimate the entropy 
of K to be 44 bits. Moreover, the authors note that if 
the key is ever compromised, the system can be used to 
“lock” a new key, since K is selected at random and is 
not a function of the biometric. 

Unfortunately, given the current construction, com- 
promise of K, in addition to the public information 
T = (t1,t2), allows one to completely reconstruct 
B = C(K) @ te. Thus, even if a user were to create a 
new template and key pair, an adversary could use the 
old template and key to derive the biometric, and then 
use the biometric to unlock the new key. The signifi- 
cance of this is worth restating: because this BKG fails 
to meet REQ-SBP, the privacy of a user’s biometric is 
completely undermined once any key for that user is ever 
compromised. 


8 Conclusion 


In this paper, we examine a series of requirements, pit- 
falls, and subtleties that are commonly overlooked in the 
evaluation of biometric key generators. Our goal is to 
encourage rigorous empirical evaluations that consider 
the impact of publicly available data to show that a BKG 
(/.) ensures the privacy of a user’s biometric, and (/I.) 
outputs keys that are suitable for cryptographic applica- 
tions. Our exposition brings to the forefront practical 
ways of thinking about existing requirements that help 
elucidate subtle nuances that are commonly overlooked 
in regards to biometric security. As we demonstrate, 
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failure to consider these requirements may result in es- 
timates that overstate the security of proposed schemes. 

To underscore the practical significance of each of 
these requirements, we present analyses of three pub- 
lished systems. While we point out weaknesses in spe- 
cific constructions, it is not our goal to fault the those spe- 
cific works. Instead, we aim to bring to light flaws in the 
standard approaches that were followed in each setting. 
In one case we exploit auxiliary information to show that 
an attacker can guess 15% of the keys on her first attempt. 
In another case, we highlight the importance of ensuring 
biometric privacy by exploiting the information leaked 
by templates to yield a 22% chance of guessing a user’s 
key in one attempt. Lastly, we show that subtleties in 
BKG design can lead to flaws that allow an adversary to 
derive a user’s biometric given a compromised key and 
template, thereby completely undermining the security 
of the scheme. 

We hope that our work encourages designers and eval- 
uators to analyze BKGs with a degree of skepticism, and 
to question claims of security that overlook the require- 
ments presented herein. To facilitate this type of ap- 
proach, we not only ensure that our requirements can 
be applied to real systems, but also introduce Guessing 
Distance—a heuristic measure that estimates the uncer- 
tainty of the outputs of a BKG given access to population 
Statistics. 
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Notes 


'Typically, / is computed over error-corrected values, and so the 
most likely element will also be the only element that has any proba- 
bility mass. 

2We note that Guessing Distance is not a distance metric as it does 
not necessarily satisfy symmetry or the triangle inequality. 

3These measures can be reproduced given the (a, y) coordinates of 
handwriting. 

“This BKG is technically an instance of the fuzzy commitment pro- 
posed by Juels and Wattenberg [21], which was later shown to be an 
instance of a secure sketch [11]. 


A Guessing Distance and Guessing En- 
tropy 


Guessing Entropy [25] is commonly used for measuring 
the expected number of guesses it takes to find an average 
element in a set assuming an optimal guessing strategy 
(i.e., first guessing the element with the highest likeli- 
hood, followed by guessing the element with the second 
highest likelihood, etc.). Given a distribution P over 2 
and the convention that P(w;) > P(w 41), Guessing En- 
tropy is computed as G(P) = S>"_, tP (wi). 

Guessing Entropy is commonly used to determine how 
many guesses an adversary will take to guess a key. At 
first, Guessing Entropy and Guessing Distance appear to 
be quite similar. However, there is one important dif- 
ference: Guessing Entropy is a summary statistic and 
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Guessing Distance is not. While Guessing Entropy pro- 
vides an intuitive and accurate estimate over distributions 
that are close to uniform, the fact that there is one mea- 
sure of strength for all users in the population may result 
in somewhat misleading results when Guessing Entropy 
is computed over skewed distributions. 

To see why this is the case, consider the following dis- 


tribution: let P be defined over Q = {wy ,..., wn} as 
P(wi) = 5, and P(w;) = Want for i € [2,n]. That is, 


one element (or key) is output by 50% of the users and 
the remaining elements are output with equal likelihood. 
The Guessing Entropy of P is: 
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Thus, although the expected number of guesses to cor- 
rectly select w is approximately 7, over half of the pop- 
ulation’s keys are correctly guessed on the first attempt 
following the optimal strategy. To contrast this, consider 
an analysis of Guessing Distance with threshold 6 = x: 
(Assume for exposition that distributions are estimated 
from a population of N = 2(n—1) users.) To do so, eval- 
uate each user in the population independently. Given a 
population of users, first remove a user to compute U/ and 
use the remaining users to compute P. Repeat this pro- 
cess for the entire population. 

In the case of our pathological distribution, we may 
consider only two users without loss of generality: a user 
with distribution ¢/; who outputs key w,, and user with 
distribution U/2 who outputs key w2. In the first case, we 
have GD5(U4,,P) = log 1 = 0, because the majority of 
the mass according to P is assigned to w;, which is the 
most likely element according to U/;. For U2, we have 
t~ = 2and tt = n, and thus GD;(2,P) = log 24". 
Taking the minimum value (or even reporting a CDF) 
shows that for a large proportion of the population (all 
users with distribution 7/,), this distribution offers no 
security—a fact that is immediately lost if we only con- 
sider a summary statistic. However, it is comforting to 
note, that if we compute the average of 2° over all users, 
we obtain estimates that are identical to that of guessing 
entropy for sets that are sufficiently large: 
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B_ Estimating GD 


As noted in Section 5, it is difficult to obtain a meaning- 
ful estimate of probability distributions over large sets, 
e.g., N°°. In order to quantify the security defined by a 
system, it is necessary to find techniques to derive mean- 
ingful estimates. This Appendix discusses how we esti- 
mate GD. The estimate also implicitely defines an algo- 
rithm that can be used to guess keys. 

For convenience we use @ to denote both a biomet- 
ric feature and the random variable that is defined using 
population statistics over @ (taken over the set Q4). If 
a distribution is not subscripted, it is understood to be 
taken over the key space Q = Qg, X +--+ x Qg,,. Our 
estimate uses of several tools from information theory: 


Entropy. The entropy of a random variable X defined 
over the set 2 is 


H(X) =— > Pr[X =u) log Pr[X =u] 
weEQ 


Mutual Information. The amount of information 
shared between two random variables X and Y defined 
over the domains Q.y and Qy is measured as 


TOES 
Pr[X=xAY=y] 
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We use the notation I(X;Y, Z) to denote the mutual in- 
formation between the random variable X and the ran- 
dom variable defined by the joint distribution between 
the random variables Y and Z. 


The Estimate. Let GD5(U5,,P 4; |ui_1, --., U1) be the 
guessing distance between the user’s and population’s 
distribution over @; conditioned on the even that @;_1 = 
...5 @1 = Uy. In particular, let Lp,, — 
(w1,...,Wn) be the elements of 4, ordered such that 
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As before, let w* = argmax,,<o, U¢;(w), and ¢~ and 
t* be the smallest and largest indexes j such that 


|Po,(w;|bs—1 = Us_1, bias $1 =u1) = 
Pow" |O-1 = Us, 12) PL=U1)| <4 


Then, GD5(Uy,, Po; ui) = log(t~ +t*)-1. 
In other words, if an adversary assumes that a target user 
is distributed according to the population and fixes the 
values of certain features, this is the number of guesses 
she will need to make to guess another feature. Unfortu- 
nately, this quantity is also infeasible to compute in light 
of data constraints so we endeavor to find an easily com- 
putable estimate. To this end, define the weight (d;) of 
an element in w € Q¢, as: 





i pee a 


dj(w | us_a,...,U1) = 
i-1 i-1 
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The weights of elements that are more likely to occur 
given the values of other features will be larger than the 
weights that are less likely to occur. Intuitively, each of 
the values (uw) has an influence on d;(w) and those values 
that correspond to features that have a higher correlation 
with ¢; have more influence. We also note that we only 
use two levels of conditional probabilities, which are rel- 
atively easy to compute, instead of conditioning over the 
entire space. Now, we use the weights to estimate the 
probability distributions as: 


Po; (wy [Rg aig eee gl) 


di (wy | wi-1,---, Us)/ Ly dw | u_1,---, 01) 
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Note that while this technique may not provide a perfect 
estimate of each probability, our goal is to discover the 
relative magnitude of the probabilities because they will 
be used to estimate Guessing Distance. We believe that 
this approach achieves this goal. 

We are almost ready to provide an estimate of GD. 
First, we specify an ordering for the features. The or- 
dering will be according to an ordering measure (/(d)) 
such that features with a larger measure have a low en- 
tropy (and are therefore easier to guess) and have a high 
correlation with other features. An adversary could then 
use this ordering to reduce the number of guesses in a 
search by first guessing features with a higher measure. 
Define the feature-ordering measure for ¢; as: 


H(gj)\ AO? 
Mo= 2, (0+ FG) 
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Finally, we reindex the features such that M(d;) > 
M(¢;41) for all i € [1,50], and estimate the guessing 
distance for a specific user with @; = 2; as: 


GD, P) = 
50 7 50 

log [1+5~ (260 Has Poclteasetrs) 1) II les 
i=1 j=i+1 


This estimate helps in modeling an adversary that per- 
forms a brute-force search over all of the features by 
starting with the features that are easiest to guess and 
using those features to reduce the uncertainty about fea- 
tures that are more difficult to guess. For each feature, 
the adversary will need to make 262; +Po;|4i—1,---¥1) 
guesses to find the correct value. Since each incorrect 
guess (2GDUs, Po, [us—15---141) — 1 of them) will cause a 
fruitless enumeration of the rest of the features, we mul- 
tiply the number of incorrect guesses by the sizes of the 
ranges of the remaining features. Finally, we take the log 
to represent the number of guesses as bits. 

Section 5 uses this estimation technique to measure 
GD of a user versus the population (GD(U ,P)), and for 
a user versus the population conditioned on the user’s 
template (GD(U , P|Zu])). The only way in which the es- 
timation technique differs between the two settings is the 
definition of Py,. In the case of GD(U ,P), Pg, is com- 
puted by measuring the i** key element for every other 
user in the population. In the case of GD(U, P[Ju]), Poi 
is computed using all of the other user’s samples in con- 
juction with the target user’s template to derive a set of 
keys and taking the distribution over the 7th element of 
the keys. 
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Abstract 


We explore the problem of secret-key distribution in 
unidirectional channels, those in which a sender transmits 
information blindly to a receiver. We consider two ap- 
proaches: (1) Key sharing across space, i.e., via simultane- 
ously emitted values that may follow different data paths 
and (2) Key sharing across fime, i.e., in temporally stag- 
gered emissions. Our constructions are of general inter- 
est, treating, for instance, the basic problem of construct- 
ing highly compact secret shares. Our main motivating 
problem, however, is practical key management in RFID 
(Radio-Frequency IDentification) systems. We describe 
the application of our techniques to RFID-enabled supply 
chains and a prototype privacy-enhancing system. 


1 Introduction 


Key management is a cornerstone of cryptography, but 
also its major deployment challenge. Textbook crypto- 
graphic protocols often presuppose keys held by a pair of 
principals anecdotally dubbed Alice and Bob. From birth, 
Alice and Bob are presumed to share a password, a secret 
key, or the public key of some mutually trusted entity. 

In practice, the conceptually simple goals of key 
distribution—even between two parties—are fraught with 
complexity. Disparate naming conventions and require- 
ments for key revocation and recovery have hobbled many 
public-key infrastructures. Password management re- 
mains a widespread challenge thanks to obstacles as var- 
ied as limited human memory, caps-lock keys, and social- 
engineering attacks such as phishing. 

Ultimately, key distribution must rely on secure chan- 
nels established through pre-existing trust relationships or 
special physical considerations. For example, browser 
software shipped with new computing systems carries the 
root public keys of a number of certificate authorities. Spe- 
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cial physical assumptions and adversarial constraints can 
shape the problem of key distribution in interesting ways. 
Researchers have explored various physical models to sup- 
port key establishment between pairs of devices, including 
optical channels [16,24], distance-bounding [30] based on 
signal velocity, and physical contact [33]. Such models 
treat a variety of adversarial capabilities. For instance, 
privacy amplification [3], which strengthens keys using 
shared sources of noise or quantum phenomena, appeals 
to bounds on adversarial data access or storage. 

In this paper, we focus on the problem of key distri- 
bution between two parties communicating via a unidi- 
rectional channel. This special constraint means that one 
party (Alice) acts exclusively as a sender, while the other 
(Bob) acts exclusively as a receiver. We consider the chal- 
lenge of unidirectional key transport when Alice and Bob 
have no pre-existing relationship, but share a channel with 
limited adversarial access. We believe that such special 
unidirectional models have broad applicability, as they re- 
flect the natural broadcast characteristics of many media. 
The starting point and motivation for our investigation, 
though, is the specific, real-world problem of key trans- 
port in RFID-enabled supply chains. 


Organization In Section 2, we give details on the RFID 
challenges motivating our work. We provide an overview 
of our technical contributions in Section 3 and review re- 
lated work in Section 4. In Section 5, we present what we 
call secret sharing in space, a key-distribution system that 
supports privacy protection in RFID applications. We also 
briefly describe a prototype RFID implementation of se- 
cret sharing in space. In Section 6, we present secret shar- 
ing in time, a separate body of techniques applicable to 
RFID access-control and authentication, and also of broad 
interest for key distribution in unidirectional channels. We 
conclude in Section 7 with a brief discussion of future re- 
search directions. 
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2 Motivation: The RFID Landscape 


The ratio of terrestrial radio and cellular telephone sys- 
tems to the number of humans on earth is approaching 
unity, and in the past decade, a completely different kind 
of radio device has emerged and is poised to eclipse this 
ratio by three orders of magnitude. Rapid advances in 
CMOS technology have enabled the production of low- 
cost tags that are capable of reporting their identity over 
a wireless link. These tags—usually costing tens of cents 
and carrying a few thousand gates of silicon—have little 
if any general-purpose computing power beyond what is 
needed to respond to commands from an interrogator or 
reader. This asymmetry between interrogators and tags 
is further amplified by the fact that, in many applications, 
tags are passive, lacking an on-board source of power; in- 
stead, they harvest power from the electric, magnetic or 
electromagnetic field generated by the interrogators. 

Recent developments in passive Radio Frequency [Den- 
tification (RFID) technology and corresponding interna- 
tional standards [12] have spurred deployment in appli- 
cations ranging from supply-chain and inventory manage- 
ment of consumer goods, to tracking medical equipment 
in hospitals, to counting poker chips on gaming tables. 

The heir apparent to the optical barcode, RFID is be- 
coming a prevalent technology in supply-chain manage- 
ment. Ultimately, manufacturers and retailers envisage 
RFID tagging of individual consumer items. Today, tag- 
ging is most common at the granularity of cases, which 
contain consumer items, and of pallets, which carry cases. 
In this paper, we use the term “case” as the generic term 
for a discrete collection of goods. 

For supply-chain operations, the predominant RFID 
standard is one known as the Electronic Product Code 
(EPC) (in particular, Class-1 Gen-2 EPC, hereafter re- 
ferred to as Gen2). EPC tags act effectively as wireless 
barcodes, emitting short strings of information known as 
EPC codes. An EPC code has four basic components: (1) 
A header, which denotes the EPC version number; (2) A 
domain manager, which typically specifies the manufac- 
turer or creator of the item; (3) An object class, which 
specifies the item type, and (4) a serial number, a unique 
identifier for the item. This /icense plate approach asso- 
ciates an arbitrary amount of metadata with the tagged ob- 
ject while requiring little memory on the tag itself. 


2.1 Security and Key Distribution in Gen2 


Two features in the Gen2 standard require secret keys: 
Locking and perma-locking: It is possible to lock part 
(or all) of the tag’s memory, either temporarily under a 
32-bit password, or permanently with no possibility of un- 
locking and rewriting the memory. While this feature pre- 
vents unauthorized entities from tampering with the con- 
tents of tag memory, it does not prevent unauthorized read- 
ers from reading the contents. 
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The kill command: The only security function that com- 
pletely disables tags is a command known as kill. When 
transmitted by a reader along with a tag-specific kill PIN 
(32 bits long in Gen2), the kill command causes a tag to 
disable itself permanently. 

The EPC kill function is envisaged as a privacy- 
enhancing feature for retail environments with item-level 
tagging. EPC tags specify the items to which they are af- 
fixed. Thus a consumer carrying EPC-tagged items would 
in principle be subject to clandestine inventorying attacks 
that disclose sensitive data about medications, reading ma- 
terials, luxury goods, and so forth. By deploying the 
kill function at the point of sale, a retail shop can pro- 
tect against such privacy infringements by disabling tags. 
Additionally, researchers have proposed anti-cloning tech- 
niques that co-opt the kill and write-access commands in 
EPC to support reader authentication of tags and to protect 
PINs from untrusted readers [15]. 

Both locking and killing pose a significant implementa- 
tion hurdle: They require a solution to the key-distribution 
problem. The initialization of tag-specific kill PINs in 
tags and the secure propagation of these PINs to point-of- 
sale devices are formidable operational challenges. Sup- 
ply chains include entities with widely disparate data- 
processing capabilities. Information transfer across orga- 
nizational boundaries, moreover, introduces a host of reg- 
ulatory and technical burdens. Hence supply-chain entities 
commonly lack data-network mechanisms for timely, reli- 
able, and secure transport of PINs. While it might seem a 
straightforward matter for Alice (a manufacturer) to share 
EPC PINs with Bob (a retailer) through a data network, in 
practice it is often quite difficult. Indeed, with all of the in- 
termediaries through which manufactured goods regularly 
pass, Alice may even ship cases without knowing that Bob 
is the ultimate receiver. 

In this paper, we show that RFID-enabled supply chains 
possess unique properties that allow us to: 


e Provide consumer privacy with respect to unautho- 
rized scanning of tagged objects; 


e Provide a robust protocol-independent mechanism to 
distribute PINs and passwords without requiring a 
network connection, changes to the air interface pro- 
tocol, or changes to the tag hardware. 


The only resource our method requires is memory on 
the tag, and we provide a means to trade-off memory usage 
against security. 


2.2 Object Hierarchies in RFID-Enabled 
Supply Chains 


Our techniques for key distribution in RFID applications 
rely in part on the fact that supply chains are hierarchical 
in nature. To highlight the properties we utilize, we use 
Figure | to trace the path of a single pack of razor-blades 
in a consumer’s home back to the manufacturing facility. 
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Figure 1: Object hierarchies in RFID-enabled supply chains This schematic represents the path taken by an individual pack of 
razor blades from the factory to the consumer’s home. Please refer to Section 2.2 for details. 


Typically, items start off in large collections and pro- 
gressively get whittled down into smaller aggregates 
as they make their way from the factory to the store 
shelf [13]. In the example above, razor blades are as- 
sembled into a pallet containing 90 cases, each with 72 
packs of blades. Assuming the items, cases, and pallet 
are tagged, we have a total of 6571 tags on this partic- 
ular pallet. The pallet is then transported, possibly with 
many other pallets, to a distribution center (DC). The DC 
de-palletizes the large pallet and assembles a mixed pallet 
with a smaller quantity of cases that has been ordered by 
the store. A typical number of cases from the original pal- 
let that make it onto this new pallet is 10 [13]. Assuming 
anew pallet tag is added, 730 of the 6571 original tags are 
now available on the new pallet. This new pallet is then 
transported to the store and stored in the backroom. Of 
these 730 tags, typically up to two cases’ worth, or 144, 
items are laid out on the store shelf for customers. From 
this collection, consumers pick up a few packs and pur- 
chase them. Therefore, the object hierarchy is as follows. 

Razor blades: 6571 — 730 — 144-5 
Similarly, for DVDs a typical object hierarchy is 

DVDs: 5040 — 2520 — 400 — 24 
where the last number represents an estimate of the num- 
ber of DVDs from a case sold to an individual consumer. 
Finally, for pharmaceuticals, we have 

Pharmaceuticals: 7200 — 1920 — 150 — 6 
where again the last number represents an estimate of the 
maximum number of filled prescriptions from one case in 
possession of a consumer at the same time. 

While these numbers may vary between different types 
of retailers and use cases, the important point to note is 
that the number of tagged items starts off large and ends 
up being small. Another important insight is that larger 
numbers of tags are typically found in physically secure 
areas, while smaller numbers of tags are found in physical 
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locations that are accessible to adversaries. We exploit the 
fact that tags share the same space-time context earlier in 
the supply chain, but this history is progressively lost as 
tagged objects emerge from the supply chain into the front 
of the retail store and thereon into the consumer’s home. 


3 Our Contribution 


The challenges of EPC PIN distribution motivate us to 
consider a new approach, that of transporting secret keys 
in RFID tags themselves. This approach allows a unidirec- 
tional model of key transport. The sender (Alice) encodes 
secrets across tags or cases. The receiver (Bob) recov- 
ers these secrets without communicating with Alice—and, 
potentially, without even knowing her identity. 

To support this unidirectional model of key transport, 
we propose protocols for dispersing keys or PINs across 
tags by means of secret sharing. We consider two distinct 
modes of secret sharing: (1) Secret sharing across space 
and (2) Secret sharing across time. 


Secret sharing across space: Alice can share a secret 
key « across a set of tags T = {t1,...,t,} in acase. To do 
so, she transforms k into a collection of shares S1,...,Sn, 
and stores S$; on tag t;, such that « can only be recovered by 
scanning all 7 tags in the cases. (We later consider thresh- 
old secret sharing, i.e., schemes such that k <n shares suf- 
fice for recovery of «.) 

Such secret sharing across tags permits a new approach 
to privacy enforcement for item-level tagging that largely 
eliminates the need for killing tags. Suppose that m; con- 
sists of the data, e.g., EPC code, associated with tag T;. 
Suppose that Alice replaces m; with E,[mj| in all tags, 
where FE, represents symmetric-key encryption under k. 
Then the contents mj; of any tag can only be deciphered by 
scanning the full set of tags T. 
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On receiving a case from Alice, a retailer (Bob) can 
recover « and decrypt the EPC codes in its tags. Once 
the items and their associated tags are dispersed by sale 
to customers, however, a would-be eavesdropper has no 
practical way to recover «. We assume here that access to 
tags is secured in the supply chain, i.e., the pre-sale envi- 
ronment. We illustrate the principle by example. 


Example 1 Alice ships a case containing three bottles 
of medicine bearing RFID tags %,%2 and 13 with data 
strings m,,m2, and m3. She generates a secret key « 
and transforms it into a triplet of shares (S,,S2,S3) via 
a (3,3)-secret sharing scheme. Alice writes the value 
v= (Ex [mj], Si) to tag T. 

Bob, a pharmacist, receives Alice’s case. He scans the 
three tags, recovers « and decrypts the data strings of the 
tags in the cases, enabling him to read m, = “High street- 
value drug, 500 mg, 100 count, bottle #8278732,” as well 
as my and m3. Bob dispenses the first bottle to Carol. 

Later in the day, a drug thief surreptitiously scans 
Carol’s RFID tags as she passes on the street. The thief 
obtains the value v, = (Ex|mi],S1)—a ciphertext and key 
share that by themselves carry no meaning and therefore 
do not reveal the presence of high-value pharmaceuticals. 


As this example illustrates, Bob does not have to per- 
form any explicit action to protect his customers’ privacy. 
He does not have to kill or rewrite tags. Secret sharing 
across space enforces privacy implicitly through the phys- 
ical dispersion of tags. Unlike killing, though, secret shar- 
ing does not enforce privacy against tracking attacks. The 
value vy; is itself a unique identifier that can serve to cor- 
relate different instances of scanning of Carol’s tags and 
potentially track Carol herself. This is a basic limitation 
of our scheme, but one we consider to be of considerably 
smaller importance than revelation of tag data contents. 

Of course, it is possible to encode « in a case-specific 
tag, rather than across items within a case. The advan- 
tage of sharing across space is twofold, though: (1) As we 
show, it allows for robust secret recovery, i.e., recovery of 
« even in the face of scanning errors or lost data and (2) It 
eliminates the need for an extra tag, i.e., one on each case. 

Our main research challenge in applying secret sharing 
across space to RFID is the development of schemes with 
tiny secret shares. While the literature on computational 
secret sharing considers shares of length equal to that of 
a secret key, e.g., 128 bits, space constraints on EPC tags 
urge even smaller share sizes, e.g., 16 bits. 

In Example 1, the adversary (thief) is underinformed, 
i.e., lacks the shares needed to recover «. Another facet of 
our research aims to create situations in which an adver- 
sary is overinformed, having too many shares to identify 
and extract tag keys. In Appendix A, we consider situa- 
tions in which an adversary is overinformed when scan- 
ning retail shelves where the contents and thus RFID tags 
of many cases are mixed together. 
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Secret sharing across time: Suppose that « is not an 
encryption key, but a write-access key. In that case, the 
ability to recover « by scanning a case would enable a 
malefactor with access to a single case at any point in the 
supply chain to modify the data contents of tags. Similarly, 
suppose that k were a symmetric key used to authenticate 
tags. Then simply by scanning a case, an adversary could 
recover all of the key material required to clone the asso- 
ciated tags. 


For this reason, we consider another form of secret shar- 
ing in which a secret key « is distributed not across the 
tags in a single case, but across multiple cases. Given that 
cases—much like data packets—depart and arrive at stag- 
gered times in a supply chain, we refer to this approach as 
secret sharing across time. 


Example 2 Alice, a manufacturer, is shipping cases of 
RFID-tagged items to Bob. She would like to communi- 
cate the write-access PINs for the tags in these cases to 
Bob as securely as possible. 


Suppose that Alice employs trucks that hold up to ten 
cases. She might do as follows. She selects a window, 
i.é., sequence, of eleven cases Cj,Cj+1,++»,Cj+10 desig- 
nated for delivery to Bob. She creates a master secret « 


from which it is possible to derive the write-access PIN for 


any tag within the window of cases. She distributes « into 
eleven shares S,,S2,...,S11 via an (11,11)-secret sharing 
scheme, and writes share Sq to case Cj4a-1. (She might 
distribute the secret across tags on individual items, or on 
a case-specific tag.) 

An adversary that gains access to the contents of a small 
collection of cases, or even an entire truckload, is unable 
to reconstruct the secret « or to obtain the write-access 
PINs for the RFID tags. On the other hand, Bob can re- 
construct « once he receives the full sequence of eleven 
constituent cases. 


Of course, in practice it may be difficult for Alice to 
identify a priori a window of cases that a legitimate re- 
ceiver, Bob, will receive in its entirety, particularly if the 
cases pass through intermediaries. Hence the main thrust 
of our work here is the development of more flexible se- 
cret sharing schemes. We propose what we call Sliding- 
Window Information Secret-Sharing (SWISS) schemes, 
constructions such that for a sequence c,C2,... of cases, 
Bob need only receive a minimal number & of cases in 
any contiguous window of size n in order to reconstruct 
the associated secret keys. SWISS schemes provide key 
confidentiality against adversaries that intercept cases on 
a sporadic basis. 


As we explain, it is a straightforward matter to create a 
SWISS scheme in which shares are linear in 1, and thus 
potentially large in practice. Our contribution is a SWISS 
scheme whose shares are constant in size, i.e., have length 
independent of k and n. 
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4 Related Work 


Since its invention in 1979 by Shamir [32] and indepen- 
dently by Blakley [4], secret sharing has played a foun- 
dational role in cryptography. However, our work differs 
from previous work in two key aspects: the privacy goal 
we adopt and the size of the shares employed. 

The majority of secret sharing literature evaluates the 
privacy of a secret-sharing scheme from an information- 
theoretic perspective, seeking to create efficient schemes 
for various access structures. In this regime, a perfect 
secret-sharing (PSS) scheme is one in which an adversary 
learns no information about the secret in an information- 
theoretic sense (i.e., even if the adversary has unbounded 
computational resources). Shamir’s scheme [32] qualifies 
as a PSS scheme. Statistical secret-sharing (SSS) schemes, 
such as Blakley’s [4], allow a small amount of information 
leakage, in the information-theoretic sense. 

A narrower literature concerns complexity (or com- 
putational) theoretic secret-sharing (CSS), in which pri- 
vacy depends on computational bounds on an adver- 
sary. Krawczyk first introduced the notion of a CSS 
scheme [20], and Bellare and Rogaway later refined and 
formalized it [2]. Work in this area has focused on pri- 
vacy based on all-or-nothing indistinguishability. In other 
words, in Krawezyk’s construction, an adversary either 
has no information about the secret or she has complete 
information about it. In this work, we introduce construc- 
tions that accommodate gradated key information. This 
allows us to consider schemes in which the leakage of se- 
cret information is proportional to and thus grows gradu- 
ally with the number of revealed shares. 

The other dimension in which this work differs from 
previous work is the length of the shares involved. It is 
well known that in any natural PSS scheme, the size of 
every participant’s share must be at least that of the se- 
cret itself [10, 18]. For specific access structures, stronger 
lower-bounds have been shown [9]. 

Any scheme in which shares are shorter than the secret 
is necessarily imperfect. Ogata and Kurosawa [26] give 
information-theoretic lower bounds on share sizes in such 
schemes. At a high level, they show that a share must 
have length equal to at least that of the “gap” in knowl- 
edge between sets of shares outside the permitted access 
structures and the secret itself. More formally, suppose 


that a secret x “ D is selected at random from distribution 
D. Let £ denote a random variable for x and $ ; one for 
S;, i.e., the i” share generated by a natural secret-sharing 
scheme. If I represents the set of access structures that 
are allowed to recover the secret, then it is the case that 
H (Sj) > minygrH (| {Si}iey), where H(A|B) denotes the 
entropy of A conditional on B. 

In terms of concrete proposals, in the information- 
theoretic literature, McEliece and Sarwate note that 
Shamir’s scheme can be generalized as a Reed-Solomon 
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code, permitting a tradeoff between share size and secu- 
rity [25]. Blakley and Meadows propose a class of ramp 
secret sharing schemes [5] which define two thresholds. 
Given ¢ shares, it is easy to reconstruct the secret. Less 
than ¢’ shares reveals no information about the secret, and 
given some number of shares y such that t! < y < ft, the 


information gained about the secret is proportional to ae 
Larger “ramps” provide weaker security but allow a reduc- 
tion in share size. In both of these proposals, the size of 
the shares is dependent on the size of the secret. 

By moving to the CSS realm, Krawczyk introduces a 
scheme with “short” shares with lengths independent of 
the secret’s size [20]. A cryptographic key is shared using 
a PSS scheme, while the secret is encrypted using the key. 
The resulting ciphertext is shared using an information- 
dispersal algorithm, e.g., Rabin’s IDA [27]. A share then 
consists of a cryptographic portion and a ciphertext por- 
tion. The cryptographic portion is at least as long as a 
cryptographic secret key plus a hash function image (thus, 
in practice, at least 384 bits). We use a similar mechanism 
to make the size of our shares independent of the secret, 
but in lieu of PSS and IDA schemes, we employ error cor- 
recting codes to reduce share sizes and add robustness. 

We are aware of no investigation, however, of the partic- 
ular problem of creating shares smaller than the short ones 
introduced by Krawczyk, i.e,, shares potentially shorter 
than a cryptographic secret key (perhaps 16 bits in length). 
Here, we characterize such shares as tiny. 

The omission from the literature of CSS schemes with 
tiny shares appears to have two underlying causes. First, 
short shares are compact enough for many applications. 
Second, the literature is solidly anchored in PSS. Even 
CSS schemes, such as that of Krawczyk, typically rely on 
PSS as a primitive to share out cryptographic keys. 


Secret-sharing in RFID: Langheinrich and Marti sug- 
gest using secret sharing to conceal an RFID tag’s infor- 
mation from adversaries with time-limited access to the 
tag [21]. The tag’s information is split using Shamir’s 
scheme [32], and the tag periodically emits a share. A 
reader that probes the tag over the course of several min- 
utes will receive enough shares to reconstruct the tag’s in- 
formation, while a casual attacker who only obtains a few 
emissions cannot reconstruct any tag information. Our 
schemes, in contrast, spread shares across multiple tags 
and consider sliding time windows with evolving secrets, 
rather than a single fixed secret. 

In other work, Langheinrich and Marti propose using 
Shamir’s scheme to distribute an item’s ID over hundreds 
of RFID tags integrated into the item’s material [22]. They 
aim to enforce privacy by requiring a reader to access mul- 
tiple tags. In contrast, we look to dispersion, rather than 
aggregation, of tags, as a privacy-enforcing mechanism. 
We also reduce the size of each share to well below the 
size of standard Shamir shares. 
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5 Secret Sharing Across Space 


Sharing a secret (e.g., a cryptographic key) across space in 
an RFID application imposes severe limitations on the size 
of each share. As discussed in Section 4, previous schemes 
typically require 128 bits or more for each share, whereas 
with RFID tags, we would like shares of 16 bits or less. 
Hence in this section we provide a generic robust secret 
sharing scheme that we refer to as a Tiny Secret Sharing 
(TSS) scheme. We define our scheme in a general problem 
framework based on adversarial games, describe a proto- 
type implementation, and suggest parameters appropriate 
for real-world deployment. 


5.1 Preliminaries 


Secret Sharing. We adhere closely to the notation 
and definitions of Bellare and Rogaway [2]. An n- 
party secret-sharing scheme is a pair of algorithms TI = 
(Share, Recover) that operates over a message space X, 
where: 


e Share is a probabilistic algorithm that takes input x € 


X and outputs the n-vector S$ = Share(x), where S; € 
{0,1}*. On invalid input ¢ ¢ X, Share outputs an n- 
vector of the special (“undefined’’) symbol L. 


e Recover is a deterministic algorithm that takes in- 
put S € ({0,1}* UU)", where ¢ represents a share 
that has been erased (or is otherwise unavailable). 
The output Recover(S) € XU L, where is a dis- 
tinguished value indicating a recovery failure. 


In our security definitions, we assume an honest dealer, 
i.e., correct execution of Share (although the adversary 
may choose the secret that is shared). 


Adversaries. While secret sharing literature tradition- 
ally defines goals with respect to access structures, we 
predicate our definitions below on a class 4 of probabilis- 
tic adversarial algorithms. We define the security of a TSS 
scheme in terms of a particular class 4. We can reconcile 
our adversarial model with the traditional access-structure 
view by restricting A to only adversaries A that respect a 
particular access structure. For example, we might con- 
sider only adversaries that compromise fewer than d legit- 
imate shares for some d. 


Error Correcting Codes. Our construction utilizes an 
error-correcting code (ECC), a generalization of secret 
sharing that we formally define as a pair of algorithms 
TI°’ = (Share**, Recover**). An (N,K,D)g-ECC oper- 
ates over an alphabet = of size |Z| = Q. Share®° maps 
= —, 2" such that the minimum Hamming distance in 
symbols between (valid) output vectors is D. For such 
a function Share“, there is a corresponding function 
Recover that recovers a message successfully given an 
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attacker that can corrupt up to D/2 players or erase the 
shares of D — | players—or some combination of the two, 
depending on the specific ECC. (In some cases, correction 
beyond the minimum distance is possible [28].) 


5.2 Problem Definition 


Informally, the adversary may attack either the privacy or 
the robustness of the scheme or both. A privacy attacker 
attempts to recover the secret x given some number of 
shares. To break robustness, the adversary aims to cor- 
rupt shares such that Recover fails to output x. We define 
these security goals formally below and conclude with a 
definition of a TSS scheme. 


5.2.1 Privacy 


We consider two subtypes of privacy attackers: an under- 
informed adversary and an overinformed adversary. An 
underinformed adversary can corrupt a limited number of 
players, while an overinformed adversary can obtain all n 
shares, but also receives a number of additional “shares” 
that she cannot distinguish from the correct shares. Due to 
lack of space, we relegate details on overinformed adver- 
saries to Appendix A. (Briefly, an overinformed adversary 
sees shares from multiple cases simultaneously, and can- 
not feasibly extract secrets due to the hardness of decoding 
given many “chaff” shares.) 


Underinformed Attacks. Here, we consider an attacker 
who obtains a limited number of legitimate shares (recall 
Example 1). In this setting, Bellare and Rogaway define 
privacy based on a notion of indistinguishability. Given 
an n-party secret-sharing scheme (IT, X), they define the 
oracle corrupt(S,i) as a function that returns S;. (“Corrup- 
tion” in this setting—corresponding to compromise of a 
share-holding player—tesults in disclosure, not change, of 
a share.) Then the Bellare and Rogaway notion of privacy 
is defined based on the experiment shown in Figure 2(a) 
In the experiment, the adversary is asked to choose two 
values to be shared. The experiment selects one of the se- 
crets at random and generates a set of shares. The adver- 
sary can then corrupt (or see the value of) individual shares 
and must eventually produce a guess as to which secret 
was shared. The corruptions and the guess may be based 
on state generated during the “choose” phase. Using this 
experiment, Bellare and Rogaway define A’s advantage as 


Advi” (11, X] 2 2Pr [Expi4[I1, X] > 1] - 1. 


5.2.2 Robustness 


We desire our scheme to allow a legitimate user to re- 
cover the original secret, even if the adversary tampers 
with some of the shares. To model a scheme’s resilience 
to such an attack, we define a robustness experiment. In 
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Experiment Exp‘ (IT, X] 
(x0,.%1) <— A(“choose”); 
b& {0,1}: 5 & Share(x;): 
Db! — Acorupt(S,-) (“corrupt”); 
output ‘1’ if b = D’, else ‘0’ 


(a) Privacy Experiment 


Experiment Exp’(“|IT, X] 
x — A(“choose’”’); 
Se Share(x); 
S! — Acorut(S,-) (“corrupt”); 
x’ — Recover({Si}icg U {Sitigs)s 
output ‘1’ if x 41’, else ‘0’ 


(b) Robustness Experiment 


Figure 2: TSS Experiments. These experiments capture our notion of privacy and robustness for TSS schemes. 


our robustness experiment, Share is invoked on a secret 
x of the adversary’s choosing. The adversary then cor- 
rupts a number of players and replaces their share val- 
ues. Again, the adversary is allowed to maintain state 
between the “choose” and the “corrupt” phases. The ad- 
versary is successful if Recover fails to recover x given 
the corrupted and uncorrupted shares as input. This ex- 
periment is much like that for robustness in Bellare and 
Rogaway, though their definition additionally includes the 
technical requirement that the adversary identify an un- 
corrupted player 7. This is not necessary for our pur- 
poses. We define the robustness experiment as shown in 
Figure 2(b), letting Ss represent the indices of the shares 
corrupted by the adversary. We define the advantage of A 
as Adv'¢°(I, X] = Pr[Exp/” (I, X] > 1]. 

It is also useful to consider a modified experiment 
Exp’ ?—4@e' that outputs ‘1’ if x 4 x’ and xv #LL, else 
‘0. In other words, A is successful if it causes a recovery 
failure that Recover does not detect. This is a weaker re- 
quirement, of course, than that represented by Exp’“‘, but 
an important condition not explored by Bellare and Ro- 
gaway. Given the above experiments, we define a TSS 
scheme as follows. 


5.2.3 TSS Definition 


Definition 1 A (k,n)-TSS scheme is a pair (I1,X), such 
that II distributes n shares of a secret x € X, of which any 
set of k correct shares suffices to recover x. The security of 
the scheme is characterized by an adversary class A and 
the tuple: (qu,€u;Qr;€r), where an underinformed attacker 
A € A making qy corrupt queries has Adv't@ (II, X] < eu; 
likewise, the pair (q,,€,-) applies to robustness attackers. 
(An extended definition can include overinformed attack- 
ers as well; see Appendix A.) 


5.3. Our Construction 


Figure 3 illustrates a high-level schematic of our TSS 
scheme. The Share’ *S algorithm accepts as input an 
arbitrarily-sized secret x. It then generates a large ran- 
dom pre-key K. We apply a hash to reduce « to the size 
of a cryptographic key k. The hash function also con- 
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Figure 3: Secret Sharing with Tiny Shares. Schematic of our 
TSS construction in a toy example with n=3. It can be used to 
distribute a key «, or optionally a secret x of arbitrary size. When 
K and x are provided at the same time, the two error-correcting 
codes may be coalesced into a single one. 


stitutes good cryptographic hygiene (and is used in our 
proofs) in the sense that it renders « indistinguishable even 
in the face of partial compromise of K. We use the key k 
to perform authenticated encryption of x and then use an 
(N,K,D)-error correcting code (ECC) to share both K and 
the ciphertext *. We focus in this paper on the basic con- 
struction that assigns a single symbol to each share. Thus 
we assume K = k. More general constructions are possi- 
ble, but omitted from this paper. A recipient with enough 
shares can apply the ECC decoding algorithm to recover « 
and the ciphertext <, and then use « to derive the key « nec- 
essary to authenticate and decrypt x. In some applications 
(e.g., transporting the master key used to derive RFID kill 
codes), we may only want to distribute a key. In that case, 
we can use K as the desired key, and eliminate the portion 
of the schematic shown in the dashed box. 

Our construction assumes that the hash function be- 
haves as a random oracle [1], and for large secrets, we 


assume the use of an authenticated encryption mode, such 
as OCB [29]. 
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Below, we state our claims for the security of this con- 
struction. We defer the proofs to Appendix B. 


Claim 1 Given our construction above, an underin- 
formed attacker's advantage is bounded by €,, such that 


Advi" (IT, X] < e, <1/0%™. 


Claim 2 Against an attacker who makes q, corrupt 
queries, if gq. < D/2, i.e. gq, < |(D—1)/2], then 
Advif"{II,X] = 0 = ¢,, and if gq, < D—1, then 
Ady'eo70r detect (II X] oe! 0 

A a 


Thus, our construction is a (k,n)-TSS scheme with se- 
curity tuple (qu, 1/O*—™, |(D —1)/2],0). 
Remark 1 With an appropriate choice of an ECC, we can 
generalize the construction above. For example, using a 
systematic version of Reed-Solomon as the ECC, « will be 
encoded in the initial code symbols. We then apply a hash 
function (SHA-256 with truncation) to those code symbols 
to derive «. If we choose Q = 2I| (and do not release 
S®), then Share" becomes a robust PSS scheme, as in 
Krawczyk’s scheme [20]. If we choose Q = 2", then we 
have the scheme described above. Intermediate choices of 
Q trade increased share size for increased security. 


5.4 Implementation Sketch and Real World 
Parameterization 


We implemented a (15,20)-TSS scheme using a Thing- 
Magic Mercury5 reader and commercially-available Alien 
Squiggle Gen2 tags. A schematic view of the setup is 
shown in Figure 4. Use of a (15,20)-TSS scheme means 
that of the 20 available tags, we need to read at least 15 
tags successfully to recover the key and decrypt tag data. 
We work over the field GF(2!°), so a share (codeword 
symbol) is 16 bits. The Share algorithm was then imple- 
mented as follows. We chose a secret key « of length 128- 
bits; we obtained « by choosing a random 240-bit value 
kK, hashing it with SHA-256, and then taking the first half 
of the output. We then encoded « into 20 16-bit sym- 
bols with a (20, 15) Reed-Solomon ECC using the built-in 
Reed-Solomon encoder in Matlab’s Communication Tool- 
box. This resulted in 20 16-bit shares, one for each tag. 

Given that we were using 96-bit tags, we had 80 bits 
left over for the tag ID. This particular parametrization re- 
quires a cipher with an 80-bit block size. We achieve this 
by using the Blowfish block cipher [31], which has a block 
size of 64 bits, with Ciphertext-Stealing [11] to expand the 
block size to 80 bits. Integrity protection at the individual 
tag ID level is provided by the Gen2 protocol. 

Each tag ID m;, 1 <i < 20, was then replaced by Ey {mj 
and concatenated with a share of K (as generated above). 
This combined 96-bit string was written into the tag us- 
ing the same setup (Figure 4). Because all Gen2 RFID 
readers can also wirelessly write to tags, this operation is 
accomplished by bring each tag into the antenna field of 
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the reader and executing a Gen2 write command. In prac- 
tice, this operation would be carried out when the case, 
pallet, or item tag is initially encoded in the supply chain. 
Note that E, as used here includes Ciphertext-Stealing as 
described above. 

For the Recover algorithm, we simply unwound Share. 
As shown in Figure 4, the reader sees encrypted tag IDs 
with concatenated shares. As long as the reader sees more 
than 15 tags, Recover running on the PC outputs the tag 
IDs successfully. 

In an ECC, a codeword consists of an ordered sequence 
of symbols. Because there is no fixed reading order for 
tags in our implementation, however, it must be order in- 
variant. That is, since shares are not distributed among 
players with fixed identities, as in our robustness exper- 
iment, we must explicitly associate an index with each 
share (effectively assigning a player index to each tag). 
Thus, the symbol on a tag must be accompanied by an 
index specifying its codeword position. Rather than speci- 
fying this index explicitly, and thereby using an additional 
16 bits of storage, we derive it implicitly based on the en- 
crypted tag ID. In particular, we hash the ID using SHA- 
256, and interpret the last 16 bits as the index; of course, 
we must do this before sharing the encryption key. This 
optimization potentially introduces a new problem: Two 
(or more) tags within a case may have ciphertexts that hash 
to the same index. A sufficiently large index size can min- 
imize this problem. (By the Birthday Paradox, GF (2!°) 
accommodates roughly 256 tags without many collisions.) 
As a further optimization, we can dedicate a few additional 
bits of storage to disambiguating collisions that do occur. 
Finally, if there are still too many collisions, we can sim- 
ply choose a new random pre-key « and compute a new 
set of shares. 

In general, the first step in parameterizing the TSS 
scheme for real-world usage is to determine the total num- 
ber of tags n and the key-recovery threshold k. As noted 
earlier (section 2.2), these numbers can vary widely be- 
tween use cases. Today, pallets typically carry from | to 
200 tags each. In a typical distribution center setting, an 
RFID reader could, depending on pallet composition, fail 
to read as many as 2—3% (1.e., 4-6) of the tags in a 200- 
item pallet, and it may pick up as many as 3—10 stray tags 
from a pallet in an adjacent dock door. This means that 
we can see up to 6 erasures, and up to 10 errors in read- 
ing. These numbers are borne out by one the authors’ (RP) 
long experience in supply chain RFID deployments. Thus 
the choice of a (200, 170)-Reed Solomon code (the min- 
imum distance D = N — K + 1 is typically omitted from 
Reed-Solomon parameterization), which can correct up 
to 15 errors or 30 erasures, would provide sufficient er- 
ror correction for real-world deployments. As discussed 
in Section 2.2, individual consumers typically have fewer 
than 40 tags from the same case, so we could alternatively 
choose a (200, 40)-Reed Solomon code to maintain pri- 
vacy and provide additional robustness to read errors. 
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Figure 4: Schematic of implementation setup 20 TSS-encoded RFID tags, at far right, are prepared using Share as described in 
the text. They are read by a ThingMagic MercuryS reader and the encrypted IDs are passed over the network to a Matlab program 
running Recover on a computer. The computer first recovers the Reed-Solomon-encoded secret key and then decrypts the tags. The 
two boxes below the schematic depict what the reader sees and the eventual decrypted tag IDs. In practice, Recover would be ported 
to run directly on the reader. Given the capabilities of current RFID readers, direct implementation on the reader is straightforward. 


Lastly, we remark on the choice of the field size. As the 
field size is the main determinant of the extra tag mem- 
ory consumed by our scheme, smaller fields mean smaller 
memory requirements. Larger field sizes reduce the num- 
ber of index collisions, which is useful both to ensure good 
decoding rates and to enforce security against an overin- 
formed adversary (Appendix A). In applications where 
only the underinformed attacker must be considered, we 
can potentially reduce the space on each tag to a single bit, 
for sufficiently large k and an appropriate ECC scheme. 


6 Secret Sharing Across Time 


Thus far, we have considered sharing schemes for one 
shipment. However, a distributor may wish to increase 
security by leveraging the fact that a legitimate recipient 
should receive more shipments than an attacker can ac- 
cess (recall Example 2 from Section 3). In this section, 
we explore a class of schemes that uses such information 
disparities across sliding time windows. In the future, we 
will investigate schemes leveraging the entropy of the en- 
tire history of interactions between a sender and recipient. 


6.1 Defining SWISS: Sliding-Window Infor- 
mation Secret Sharing 


In the schemes below, we assume a sender periodically 
emits a share S;. For RFID purposes, we may suppose the 
sender is a manufacturer who periodically ships out con- 
tainers of RFID-labeled items. Each share may optionally 
be further shared out amongst the RFID tags in the con- 
tainer as described in Section 5. Each period also has an 
associated key kj. Thus, we have a sequence of shares 
S = {So,S1,...} that expands indefinitely over time. We 
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Figure 5: In this example, if the adversary holds a set § of k =3 
shares (shown as shaded boxes), then we define p(S) as the union 
of all (three) windows of n = 6 shares containing the original k 
shares. We require that the adversary be unable to recover keys 
for periods outside of (8). The figure assumes = 0. IfK = 1, 
then p($) would include two additional shares: one before and 
one after the set p(S) currently shown. 


assume that within any window of n elements, only a le- 
gitimate recipient receives at least k of the shares in that 
window, and given those shares, the recipient should be 
able to recover the corresponding keys. An adversary re- 
ceiving fewer shares should learn nothing about the keys. 

More formally, a SWISS scheme is defined as a pair of 
algorithms IT = (Share, Recover), where: 


e Share(k,n,t) is a probabilistic algorithm that takes as 
input a threshold for recoverability k, a window size 
n, and a security parameter t. It outputs two “infi- 
nite” vectors k and S, where «; € {0,1}* is the key 
for period 7, and S; is the share for period 7. On in- 
valid input, Share outputs the special symbol L. 

e Recover is a deterministic algorithm that takes as in- 
put S’ C W; where W; defines a sequence of n shares 
starting at time j such that Wj = {S;: j <i< j+n}, 
and |S’| > k. The output of Recover(S’) is a set of 
keys K = {«; : S; € S’} for the shares provided in S’ 
or |, a special value indicating a recovery failure. 
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In our security definitions, we again assume an honest 
dealer, i.e., correct execution of Share. Below, we give 
formal definitions for our privacy and recoverability re- 
quirements. 


Privacy. To define privacy, we require that the adversary 
cannot obtain the key for any share she does not possess. 
If the adversary holds fewer than k shares, she should not 
learn any keys. We deal with the case in which the adver- 
sary holds more than k shares as follows. 

Define the set of shares held by the adversary as S. Let 
0(S) be the set of all shares that lie in a window of size n+ 
i for which the adversary has recovered at least k shares. 
We require the adversary to be unable to recover the key 
for any element in P($), the complement of p($). Since 
k shares allow the adversary to recover all of the keys in 
a window of size n, the value of A indicates the amount 
of information & shares “leak” about keys not contained 
within a window of n shares. Figure 5 illustrates these 
requirements. 

More formally, we can define privacy based on the fol- 
lowing experiment: 


Experiment Exp’? [IT] 
(S,«) = Share(k,7,T); 
i — A(‘“choose’’); 
«® € 10,1}; b A {0,1}; 
b! — Acrupt(S.) ((b, «®, «;), “corrupt”); 
ifi d p(S) ori gS then 
output ‘1’ if b’ = b, else ‘0’; 
else output ’0’; 


where 1(0,x,y) = (x,y) and a(1,x,y) = (y,x). Essen- 
tially, the adversary is asked to choose a time period 
i. After corrupting some number of shares, the ad- 
versary must distinguish between the key for period i 
and a randomly selected key. We consider the ad- 
versary successful if the period chosen does not corre- 
spond to a share held by the adversary, or if the pe- 
riod lies outside the set (Ss ) induced by the adversary’s 
selection of shares. The adversary’s advantage is then 


Aye (1 256, ap (]] =>1/—1. 


Recoverability. We require that any set S’ C W; with 
|S’| > k shares suffices to recover the keys associated with 
each share in the set, namely {x; : 5; € S’}. We define re- 
coverability for a legitimate recipient in the erasure model; 
in other words, shares may be lost but not corrupted. We 
can convert our SWISS schemes to a corruption model 
by replacing our use of PSS schemes with robust PSS 
schemes, such as Krawcezyk’s [20]. 


Definition 2 We define a (k,n)-SWISS scheme as a pair 
of algorithms TI as defined above where Share produces 
shares of size u. The security is characterized by the pair 
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(A,€), where (as explained above) k shares are sufficient 
to reveal i “nearby” keys for time periods not contained 
in a window of n shares, and Advit4—S"S [TT] < ¢. 


Thus, an ideal SWISS scheme would have (A,€) = 
(0,0) with minimal w. 


6.2. A Family of SWISS Schemes 


In our SWISS construction, we want to ensure that the 
secret for a case is only available given possession of that 
case. To achieve this property, we make the key x; for 
case i a function of both a window key and a secret value 
associated with the case (or its RFID tag). 

Ideally, the window key for a window of n cases should 
be recoverable if and only if the receiver possesses k or 
more cases within that window. A naive SWISS scheme 
would simply generate a key for every possible window of 
size n and share each key using a (k,7) scheme. But a case 
would then need a share for every window covering it, and 
hence the per-case share size would grow linearly with the 
size (1) of each window. 

Instead, we aim to bring the share size down to a small 
constant independent of k and n. We use two techniques 
for this goal. First, we allow some sloppiness in our access 
structure. Our access structure (in our main construction) 
depends on superwindows of size 2n that each overlap 
with the previous superwindow by n (see Figure 6); each 
superwindow secret is shared using a (k,2n) scheme. Ac- 
cess to a window secret requires recovery of the secrets for 
either one of its two corresponding superwindows. Any k 
shares in a sequence of size n fall into some superwindow 
of size 2n, and therefore allow recovery of the superwin- 
dow secret. The “sloppiness” is this: Recovery of shares 
in one window allows for recovery of secrets in nearby 
windows. 

Given the superwindow scheme described above, we 
could encrypt the secret «; for each case 7 under each of 
its corresponding superwindow secrets, o and o’. How- 
ever, using a second technique based on bilinear maps, we 
can derive a common secret directly from either of the two 
superwindow secrets 0 or 0’. 

Below, we first explain the assumptions necessary for 
our schemes. Then we present our main SWISS construc- 
tion (Section 6.2.2) and show how to generalize it to a 
wider range of parameters (Section 6.2.3). 


6.2.1 Assumptions 


Our family of SWISS schemes uses bilinear pairing to re- 
duce storage costs. In the full version of this paper, we 
describe a variant of our SWISS construction based on the 
more standard RSA assumption. Unfortunately, that ver- 
sion does not generalize efficiently to large window sizes 
in the same way as does the bilinear map scheme, and 
hence we focus on the latter. 
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We give some very brief preliminaries on bilinear maps, 
referring the reader to [7] for details. Let E be a mul- 
tiplicative cyclic group of prime order p under a bilinear 
operator é as defined in Boneh-Franklin [7]. Thus we have 
é:ExE- +E’ forasecond group E’. The bilinear operator 
é has the property that VG,H € E,é(G",H”) = é(G,H)”; 
it is also non-degenerate, meaning that if G is a generator 
of E, then é(G,G) # 1. 

Our work relies on the hardness of a slightly modified 
Bilinear Diffie-Hellman Exponent (BDHE) problem [6,8]. 
Specifically, let g and y be random generators of FE, and a 
be a random element in Z;,. Our (¢,L)-BDHE problem is 
defined as: 


at) 


Given g,y,g") fori =1,2,...,€-L,€+1,...,2€ 
and yy") fori=1,2,...,.L—1 
compute é(g,y)(). 


In the original framing of the ¢-BDHE problem [6, 8], 
only y (rather than additional a powers of y) is assumed 
to be known. We assume that L > 2, since the (¢,1)- 
BDHE problem simply degenerates to the ¢-BDHE prob- 
lem. Loosely speaking, the (¢,L)-BDHE assumption in E 
says that no efficient algorithm can solve the (¢,L)-BDHE 
problem in F with non-negligible probability. 

We can apply the “master” theorem of Boneh et 
al. [6] to bound the difficulty of (¢,L)-BDHE in a 
generic group. In their terminology, we have P = 
(VF aa ie yd OU aes) O = (1) -and 
f = xy. This implies that an attacker A with advan- 
tage 1/2 in solving the decision (¢,L)-BDHE problem 
in a generic bilinear group E must take time at least 


Q (VPI — 20). E.g., if we assume the distributor 


sends one billion windows (or less), then solving the de- 
cision (¢,L)-BDHE problem in a generic bilinear group E 
of size 192 bits takes time at least 28°. Of course, a lower 
bound in a generic group does not imply a lower bound in 
any specific group. 


6.2.2. Our Main SWISS Construction 


In Section 6.2.3, we present a fully generic overlapping 
SWISS scheme, but first, to simplify the exposition, we 
describe a single member of the family (see Figure 6). 
This example provides a (k,n)-SWISS scheme with u = 3t 
and security parameters (2n — k,€). 

Starting at time 0, the sender defines a series of su- 
perwindows Wo,Wn,Won,.--,Wen, each of size 2n. Thus, 
each superwindow consists of two windows of size n, with 
one window overlapping a window from the previous su- 
perwindow, and one window overlapping a window from 
the subsequent superwindow. Each superwindow W,,, de- 
fines a (k,2n) perfect secret sharing (PSS) of the super- 
window secret 6;,,. Since each time period / is covered by 
two superwindows, each comprising its own secret shar- 
ing scheme, the share S; distributed in each time period 
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Figure 6: Each superwindow of 2n shares (in the example 
shown here, n = 3) overlaps with the previous superwindow by n 
shares. Each superwindow Wey is a (k,2n) sharing of the super- 
window secret Og,. Each time period is covered by two super- 
windows. For example, the share labeled A is covered by super- 
windows Wo and W,. As a result the key for that period Kk, can 
be recovered from either superwindow secret, 00 OF On. 


: l+1 
consists of two sub-shares (sé, . My one for Og, and 


one for O(¢+1)n- We also augment the share with a random 


nonce rj & {0,1}*. Thus, the share emitted during time 
period i is S$; = (s6r, 6D" yi), 

Because any time period i is covered by two superwin- 
dows (say Wp, and We +1)n)> We would like the key x; to 
be recoverable from the superwindow secret of either one 
(since we do not know a priori in which superwindow the 
recipient will have k shares). Like many problems in com- 
puter science, we can solve this by adding another layer of 
indirection. Let y,z € E, a € Z, and let (Po, Pi) = (y,y*) 
be a public key. Let each of the superwindow secrets be 
defined so that ov, = ce We define a series of window 
secrets Wg, Wy, ---, en SO that 


K ee os é+1 
Olen = e(P1,0¢n) _ (Po, O(¢-41)n) = ély,z) 


That is, knowing oy, allows a recipient to derive wy, and 
O(e+1)n- 

Finally, we define each key «; based on the window it 
belongs to, as well as the random nonce 7; distributed with 
share Sj, as Kj = h(1;, gn), where h: {0,1}* — {0,1}* is 
a hash function modeled as a random oracle [1]. 

In the next section, we show how to generalize this con- 

struction to decrease A at the cost of increasing the size 
of each share. We can define an adversary for this more 
general scheme as follows: 
Definition 3 We define an (€,L,q)-adversary A as an at- 
tacker who achieves an Advit4—""SS(TT] < ¢ advantage in 
our privacy experiment (defined in Section 6.1), where 
II is an instantiation of our generic SWISS family with 
W=L-—1 (for L > 2) that produces at most 2¢ shares. 
The adversary makes at most q random oracle queries. 

In Appendix C, we use this definition to demonstrate the 
security of the generalized scheme (and hence this specific 
instantiation) by proving the following theorem: 
Theorem 1 For any polynomial-time (¢,L,q)-adversary 
A with Advitt—iss — ¢ and €>L > 2, there is a 
polynomial-time adversary A’ that solves the (¢,L)-BDHE 
problem with probability (e —2~*)/qé— 1/2". 
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Essentially the theorem states that given an adversary who 
achieves a non-negligible advantage in our privacy experi- 
ment, we can construct an attacker who violates the (¢,L)- 
BDHE assumption. We also demonstrate that this con- 
struction satisfies our recoverability requirement. 


Remark 2 As described, our SWISS construction uses a 
PSS scheme to create superwindow shares. Thus, the 
construction tolerates erasures but not errors. However, 
we could readily replace the PSS scheme with a robust 
scheme, such as our TSS scheme from Section 5, which 
would both decrease the size of the individual shares and 
add error tolerance to the SWISS construction. 


6.2.3 A Generic SWISS Family 


The above scheme can be generalized to allow decreased 
values of 4 at the cost of increased storage (see Figure 7). 
Specifically, for any value of W <n, we can create a (k,7) 
SWISS scheme with » = (W + 2)t and security parameters 
(1+ p)n—k,e). 

Essentially, we divide each superwindow W into W + 1 
windows of size yj. The superwindows form (k, a 
sharing schemes of the superwindow secrets, and each su- 
perwindow overlaps the previous superwindow by W win- 
dows. Thus, any given window is covered by W + | super- 
windows, and the window secret can be recovered from 
any of the superwindow secrets, using the same elliptic 


curve pairings technique as before. In other words, we 
ws 
-a 


define a public key (Po, Pi,...,Pw) = (x,x%,...,.x 
window secret Wy, is defined as: 


), and a 


Oln (Py, Orn) = €(Py_1, 0(041)n) = wee 


= (Po, dram) = (x,z)™. 


To determine A, we consider the worst case, in which k < 
wy and the adversary’s k shares fall within a single win- 
dow. The window then is covered by Y+ 1 superwindows, 
allowing the adversary to recover secrets for 2 + | win- 
dows, or (2W+ 1) = 2n+ @ secrets. These secrets can 


be at most a superwindow ee 1 n) away from the k secrets 


held by the adversary, so A = Sin —k=(1+qy)n—k. If 
k > q, then fewer than Y + 1 superwindows will contain 
k shares, and hence A will be even smaller. 

In our example scheme from Section 6.2.2, W = 1, so 
each superwindow formed a (k, 27) secret-sharing scheme, 
but we could also use W = 2, with each superwindow con- 
sisting of 3 windows of 4 shares, and the superwindow as 
a whole constituting a (k, 3n) sharing of the superwindow 
secret (see Figure 7(a)). This would produce a smaller 
value of A = 3n —k, but at the cost of larger shares: each 
issued share would now contain three shares (one for each 
superwindow overlapping a particular window) and the 
random nonce /;. 








17th USENIX Security Symposium 


6.2.4 Real World Instantiation 


To make our SWISS construction more concrete, we sug- 
gest sample parameters for real world deployments. Sup- 
pose the sender needs to ship one million or fewer shares. 
We divide those shares into 10,000 windows of 100 shares 
each, giving us = 5,000,” = 100. A legitimate recipi- 
ent will receive at least k = 20 shares in any window. If 
we use the scheme from Section 6.2.2, then YW = 1 and 
L=W+1=2. Finally, if we use t = 128 bit keys, then 
the share for each period will be 3t = 384 bits in size. In 
contrast, the nalve scheme described earier in this section 
would require nt = 12,800 bits per share. 


We described both our SWISS scheme and the naive 
scheme using PSS as a component. If we replace the PSS 
scheme with our TSS scheme from Section 5, then we 
have a share size of 16 bits. In our scheme, we still need 
a random nonce of at least 60 bits, but that yields shares 
of size 2- 16+ 60 = 92 bits, just small enough to fit on 
an EPC tag. In contrast, the naive scheme would require 
n- 16= 1,600 bits. 


7 Conclusions and Future Work 


We have described two approaches to secret sharing in 
unidirectional channels: secret-sharing across space and 
secret-sharing across time. As we have shown, secret- 
sharing across space is a tool of practical promise for 
privacy protection in real-world RFID-enabled supply 
chains. Our SWISS scheme for secret-sharing across time 
can, similarly, help address the challenges of RFID tag and 
reader authentication. An open problem of particular in- 
terest in our SWISS construction, however, is elimination 
of its reliance the non-standard (¢,L)-BDHE problem in 
our fully generic overlapping SWISS scheme. We also 
plan to investigate extended SWISS schemes that leverage 
the entire history of interaction between a sender and re- 
ceiver, rather than simply a window of recent history. 


More broadly, we believe that a holistic view of the spe- 
cial operational requirements of RFID tags and the highly 
constrained resources of tags can give rise to important 
new cryptographic problems. Our future work will aim 
to calibrate cryptographic tools like those presented here 
to RFID supply-chain infrastructure as it evolves and its 
special operational demands come into clearer focus. 
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EB) EEE) EE) Eel EG es 


(a) A SWISS scheme with Y = 2, = 4. Each superwin- 
dow is a (k,3n/2) sharing of the superwindow secret. 


OOOOOOOOOD -: 


(b) A SWISS scheme with Y = 3,n = 6. Each superwindow shown is a 
(k,4n/3) sharing of the superwindow secret. 


Figure 7: Additional SWISS examples We can create additional SWISS schemes by increasing the number of windows per super- 


window while decreasing the number of shares in each window. 
of shares that must be held in each time period increases. 
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A Overinformed Adversaries 


In the body of the paper, we discuss the notion of an un- 
derinformed adversary, one that has an insufficient set of 
shares to reconstruct a secret key. We also briefly con- 
sider an overinformed adversary., one that possesses a set 
of shares sufficient to reconstruct one or more secret keys, 
but has too many shares to feasibly determine such keys. 
We can design our system such that an adversary is over- 
informed in settings where the adversary is forced to scan 
the contents of not one, but multiple cases simultaneously. 

Consider, for example, an attacker who periodically 
scans a store shelf, hoping to accumulate enough shares 
to recover the associated key. The adversary’s reader may 
receive responses from items that arrived in multiple in- 
dependent cases. In this situation, we would like it to be 
hard for the adversary to recover any case secret from the 
full set of secrets, even if a subset of the adversary’s shares 
would suffice to reconstruct the secret. We can appeal to 
the fact that when shares from multiple cases are mixed 
together, the large set of shares can make it hard to decode 
any individual secret. 

To help render an attacker overinformed, we can delib- 
erately introduce “chaff” among the shares S; in a case. 
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Essentially, we replace € shares of K with randomly chosen 
values. The choice of 0 < € < D/2 represents a tradeoff 
between security against an overinformed attacker and the 
error-tolerance of the scheme. For example, by choosing 
C= g, an adversary who recovers the shares from two se- 
crets will hold oo chaff values—potentially exceeding the 
recovery threshold for the ECC scheme, as we now show. 
In this situation, however, a legitimate recipient can still 
tolerate 2 errors in the shares she receives. 


The following experiment formalizes the notion of an 
overinformed adversary. 


Experiment Expi”’ (TI, X, a, 8] 
(x4 , saagateg) & X; 


ce Uz, C’, where C! C Share(x;), and |C’| = B; 
H —{h:h=H(x),1<i< a}; 

x! — Acrrupt(C,”) (“corrupt”); 

output ‘1’ if x € (x4,...,%q), else ‘0’ 


In this experiment, we choose a random secrets. The 
adversary has access to an unlabeled set of shares, 
which contains 6 randomly chosen shares from each 
secret. The adversary also receives the hash H of 
each secret. Given this information, the adversary 
must recover one of the original secrets. In this ex- 
periment, we define the advantage of adversary A as 


Advi’ IT, X,a,,B] 2 Pr [Exp (11, X,a,8] > 1). 


We can characterize the overinformed adversary’s task 
in terms of the polynomial reconstruction (PR) problem, 
the decoding of a Reed-Solomon codeword in the presence 
of errors (see [19] for detailed discussion). 


Given an underlying (N,K)-Reed-Solmon code, and a 
set of ¢ symbols, of which € are corrupted, the classical 
Peterson-Berlekamp-Massey (PBM) algorithm [23] suc- 
cessfully decodes a set of symbols if f—€ > (t+ K)/2 
(or, equivalently, ¢ < (t —K)/2. A more powerful decod- 
ing scheme is that of Guruswami and Sudan (GS) [14], 
which successfully decodes for t—€ > KN in any field 
of cardinality at most 2. It is conjectured that decoding 
beyond the error bound represented by GS is infeasible in 
a general sense and thus that GS offers a likely bound on 
the hardness of the PR problem. 


That said, there are different formulations of the PR 
problem and little work on the concrete hardness of the 
problem. Schemes that achieve unconditional security, 
e.g., [17] do not offer attractive parameterization ranges 
for our purposes. Choosing credible and practical hard- 
ness assumptions for an overinformed adversary in our 
scheme is an open problem. 
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A.1 Parameterization of Our RFID Secret- 
Sharing Scheme 


We give a brief characterization of what we believe to 
be secure and feasible parameterizations of our scheme. 
These parameterizations permit PBM decoding for the le- 
gitimate reading of a single RFID-tagged case and at the 
same time exceed the GS bound for security against over- 
informed adversaries. We emphasize, however, that fur- 
ther research is needed for firm determination of the secu- 
rity of our scheme in a concrete sense. 


Suppose that a case contains N tags, of which © are 
chaff. PBM decoding for a scanned case is always possi- 
ble when the number of corruptions (or erasures) of valid 
symbols e is such that N — (e+) > (N+K)/2. 


Example 3 Suppose that K = 8, N = 200, and C = 86. 
Then it is possible to recover the secret associated with a 
case for e < 10, and thus up to a 5% corruption of tag 
symbols. 


Suppose that an adversary reads symbols associated 
with qg cases and attempts to recover the secret x associ- 
ated with a particular case. We can establish a lower bound 
on the hardness of this problem by rendering the problem 
easier for the adversary. In particular, let us assume that 
the adversary has access to an oracle that identifies valid 
shares associated with the g — | untargeted cases (but does 
not otherwise reveal which shares correspond to which 
case). Then the adversary can reduce the problem of re- 
covering x to a decoding problem with N —C valid shares 
and Cq chaff shares, and thus t = N + (gq —1)€ shares in 
total. The GS bound implies that recovery of x is hard if 


N-O<\/K(N+(q—1)0). 


Example 4 Suppose that K = 8, N = 200, and ¢ = 86. 
Then the problem of recovering a target case secret x is 
hard under the GS bound if 114 < ,/848 + 688q, and thus 
for q= 18. 


A stronger bound is possible assuming that valid sym- 
bols, i.e., secret-bearing data, in untargeted cases may 
be regarded as chaff. This gives us a slightly unortho- 
dox problem distribution in which a problem instance 
has g embedded, secret polynomials. In this case, how- 
ever, the GS bound implies that recovery of x is hard if 
N—C< J/qKN. With an appropriate parameter choice, 
we can obtain strong concrete results. 


Example 5 Suppose that K = 100, N = 200, and € = 40 
(giving a 5% correction buffer in the single-case setting, 
as above). Then the problem of recovering a target case 
secret x is hard under the GS bound if 160 < ./20000q, 
and thus for q = 2. 


USENIX Association 


B_ Proofs of Security for Our Tiny Secret 
Sharing (TSS) Scheme 


B.1_ Proof of Privacy 


Since many of our applications only require the distribu- 
tion of a secret key, we first define a simplified experiment 
to measure the indistinguishability of «. Note that for this 
experiment, we excise the portion of our scheme in the 
dotted box in Figure 3. Effectively, we share out a null 
secret x, and write Share() to indicate this fact. The proof 
of privacy for secrets of arbitrary size then follows in a 
straightforward manner. 


We define a key indistinguishability experiment as: 


Experiment Exp'"“—* (IT, X] 
(ie? ae) = Share(); («!, S!) = Share(); 
b& {0,1}; 
p& Acomut(S) (0 41 “corrupt”); 
output ‘1’ if b = b’, else ‘0’ 


In this experiment, the adversary receives two se- 
cret keys generated by our sharing algorithm, as 
well as the shares corresponding to one of the 
keys and must determine to which key they corre- 
spond. We define the advantage of adversary A as 


Advit’-*(11, X]=2Pr [Expit*(11, x] > 1] - 1, 


For a generic ECC, if the adversary makes at most ¢,, 
corrupt queries, then her total amount of information is 
upper-bounded by Q@. Since we model the hash function 
applied to pre-key K as a random oracle, the adversary’s 
advantage in distinguishing «° and «! is bounded above 
by Adv'"¢—* (I, X] < 1/Q*~%, Assuming an encryption 
algorithm in which key indistinguishability implies cipher- 
text indistinguishability (e.g., in an ideal cipher model), 
this bound then translates to the more general sharing of 
an arbitrary secret. Thus, we have Advi (II, X] < e, < 
1/Q*‘-™, This yields Claim 1 from Section 5.3. 


B.2 Proof of Robustness 


With a generic linear (V,K,D)-ECC, it is possible to re- 
cover a message from a codeword with fewer than D/2 er- 
rors. Thus, as long as the adversary does not corrupt D/2 
shares, €- = 0. Similarly, such a code can recover from 
D —1 erasures; and can also detect up to D — 1 errors. As 
discussed in Appendix A, we can deliberately introduce ¢ 
chaff shares into the ECC to confound the overinformed 
adversary. This would change are security parameters 
such that if g, <D/2—C, then Adv/f‘|II, X] = 0 = ¢,, and 
if g¢¢ <D—1—t, then Advife-°—4et TT X] = 0. This 
yields Claim 2 from Section 5.3. 
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C_ Proofs of Security and Recoverability for 
our SWISS Scheme 


We prove that our generic family of SWISS schemes from 
Section 6.2.3 meets our privacy and recoverability require- 
ments. Since our main construction from Section 6.2.2 is 
a specific instantiation (with WY = 1), its security follows 
from the security of the generic family of schemes. 


C.1_ Proof of Privacy 


To demonstrate that our generic family of SWISS schemes 
achieves our privacy requirement, we prove Theorem | 
based on the adversary specified in Def. 3. Recall that our 
generic family of SWISS schemes is parameterized by W, 
one less than the number of overlapping superwindows. 


Proof of Theorem 1: Suppose we are given an (¢,L)- 
BDHE instance comprising ye) fori = 1,2,...,L—1 and 
the sequence U! = gla) fori = 1,2,...,€-L,@+1,...,2¢. 
We construct a SWISS-scheme simulator based on an 
(¢,L,q)-adversary A as follows. 


Simulator Construction. First, we construct an ap- 
propriate public key by letting (Po,Pi,...,Pr-1) = 
(y,¥%, ey"): Then, we select a random j € {1,...,¢}. 
This index is our guess as to the superwindow in which 


the adversary will select a challenge key. If we let 
g=2' (0°) then U’ contains the subsequence U = 
gt gt gl” galt gt 


We use this subsequence U as the set of underly- 
ing superwindow keys in the procedure described in 
Section 6.2.2, with each superwindow representing a 


(k, tin) sharing of g(“), For the superwindows corre- 


sponding to gla) glo!) (which are unknown), we 
simply share a random value. This procedure creates a set 
S of shares. If A queries corrupt(S,i), we respond with Sj. 

To respond to hash queries, we keep a list V/ of previous 
queries. Thus, when A invokes /(y,z) for the first time, 


we choose a random value y & {0,1}* and add (y,z,v) 
to the internal list VV. If A has previously invoked h on 
(y,z), then we return the corresponding value of v from 
V. This creates a perfect implementation of the random 
oracle contract. 

When A terminates, we ignore its output, choose a ran- 


R 
dom hash response (y,z,v) — V and return z. 


Simulator Correctness. From the SWISS adversary’s 
point of view, the construction above accurately simulates 
the ind-swiss Experiment. Our replies to the hash queries 
perfectly instantiate a random oracle, so they offer the ad- 
versary no information with which to distinguish a real 
experiment from a simulation. Our construction deviates 
from the true protocol in one important respect: the keys 
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for the superwindows corresponding to gett). ar go) 
are chosen at random (since we do not know the appro- 
priate values). However, the definition of p precludes the 
adversary from recovering these superwindow secrets, and 
hence, she cannot determine that these values do not con- 
form to the expected structure. Nonetheless, because we 
choose the superwindow secrets at random, we cannot pro- 
vide the adversary with the correct value of «;. In other 
words, from our perspective, the value of «; provided to 
the adversary is a random value. At some point, the ad- 
versary will query h(7;,@x,), but since we cannot recog- 
nize W,z,, we will not know that we should return «;. For- 
tunately, by the time the adversary makes this query, we 
have already extracted the necessary information, namely 
xn, SO that even if the adversary quits upon determining a 
discrepancy, we will still succeed. 


Probability of Success Our guess j for the superwin- 
dow from which A selects a challenge key ; is correct 
with probability > 1/¢. Since h has a range of {0,1}* 
and A has an ¢€ advantage, it is clear under the random or- 
acle assumption on / that A inputs oj, with probability 
>e—2-*. IfA has queried h with w ;, in the course of the 
simulation, then the probability that we output the correct 
in = 0(g,7)@) is just 1/¢. 

The only other way the adversary can succeed is by re- 
covering a key for a share she does not hold. However, 
without the share, the adversary has no knowledge of 7;. 
The random oracle assumption on / guarantees that the ad- 
versary succeeds in guessing k; with probability less than 
1/2*. Our theorem bound follows. O 


C.2 Proof of Recoverability 


A legitimate receiver (one who recovers at least k shares 
out of some window W’ of n shares) can determine the 
key corresponding to each share. Observe that given the 
overlapping superwindow construction, the window W’ is 
entirely contained within at least one superwindow Woy. 
Thus, k elements from W’ suffice to reconstruct the su- 
perwindow secret 6y,, which can be used to calculate the 
window secrets gn, ©(¢41)n> +++» ®(¢4W)n- Each window is 
of length n/W, and hence these two window secrets cover 
all (W + 1)n/W elements in superwindow W;,. Using the 
random nonce 7; in each share S;, the legitimate receiver 
can calculate x; by hashing r; with the appropriate win- 
dow secret. 
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Abstract 

Antivirus software is one of the most widely used tools 
for detecting and stopping malicious and unwanted files. 
However, the long term effectiveness of traditional host- 
based antivirus is questionable. Antivirus software fails 
to detect many modern threats and its increasing com- 
plexity has resulted in vulnerabilities that are being ex- 
ploited by malware. This paper advocates a new model 
for malware detection on end hosts based on providing 
antivirus as an in-cloud network service. This model en- 
ables identification of malicious and unwanted software 
by multiple, heterogeneous detection engines in paral- 
lel, a technique we term ‘N-version protection’. This 
approach provides several important benefits including 
better detection of malicious software, enhanced foren- 
sics capabilities, retrospective detection, and improved 
deployability and management. To explore this idea we 
construct and deploy a production quality in-cloud an- 
tivirus system called CloudAV. CloudAV includes a 
lightweight, cross-platform host agent and a network ser- 
vice with ten antivirus engines and two behavioral detec- 
tion engines. We evaluate the performance, scalability, 
and efficacy of the system using data from a real-world 
deployment lasting more than six months and a database 
of 7220 malware samples covering a one year period. 
Using this dataset we find that CloudAV provides 35% 
better detection coverage against recent threats compared 
to a single antivirus engine and a 98% detection rate 
across the full dataset. We show that the average length 
of time to detect new threats by an antivirus engine is 48 
days and that retrospective detection can greatly mini- 
mize the impact of this delay. Finally, we relate two case 
studies demonstrating how the forensics capabilities of 
CloudAV were used by operators during the deployment. 


1 Introduction 


Detecting malicious software is a complex problem. The 
vast, ever-increasing ecosystem of malicious software 
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and tools presents a daunting challenge for network op- 
erators and IT administrators. Antivirus software is one 
of the most widely used tools for detecting and stopping 
malicious and unwanted software. However, the elevat- 
ing sophistication of modern malicious software means 
that it is increasingly challenging for any single vendor to 
develop signatures for every new threat. Indeed, a recent 
Microsoft survey found more than 45,000 new variants 
of backdoors, trojans, and bots during the second half of 
2006 [17]. 

Two important trends call into question the long term 
effectiveness of products from a single antivirus vendor. 
First, there is a significant vulnerability window between 
when a threat first appears and when antivirus vendors 
generate a signature. Moreover, a substantial percentage 
of malware is never detected by antivirus software. This 
means that end systems with the latest antivirus software 
and signatures can still be vulnerable for long periods of 
time. The second important trend is that the increasing 
complexity of antivirus software and services has indi- 
rectly resulted in vulnerabilities that can and are being 
exploited by malware. That is, malware is actually us- 
ing vulnerabilities in antivirus software itself as a means 
to infect systems. SANS has listed vulnerabilities in an- 
tivirus software as one of the top 20 threats of 2007 [27]. 

In this paper we suggest a new model for the detec- 
tion functionality currently performed by host-based an- 
tivirus software. This shift is characterized by two key 
changes. 


1. Antivirus as a network service: First, the detec- 
tion capabilities currently provided by host-based 
antivirus software can be more efficiently and ef- 
fectively provided as an in-cloud network service. 
Instead of running complex analysis software on ev- 
ery end host, we suggest that each end host run a 
lightweight process to detect new files, send them to 
a network service for analysis, and then permit ac- 
cess or quarantine them based on a report returned 
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by the network service. 


2. N-version protection: Second, the identification of 
malicious and unwanted software should be deter- 
mined by multiple, heterogeneous detection engines 
in parallel. Similar to the idea of N-version pro- 
gramming, we propose the notion of N-version pro- 
tection and suggest that malware detection systems 
should leverage the detection capabilities of multi- 
ple, heterogeneous detection engines to more effec- 
tively determine malicious and unwanted files. 


This new model provides several important benefits. 
(1) Better detection of malicious software: antivirus en- 
gines have complementary detection capabilities and a 
combination of many different engines can improve the 
overall identification of malicious and unwanted soft- 
ware. (2) Enhanced forensics capabilities: information 
about what hosts accessed what files provides an incred- 
ibly rich database of information for forensics and intru- 
sion analysis. Such information provides temporal rela- 
tionships between file access events on the same or dif- 
ferent hosts. (3) Retrospective detection: when a new 
threat is identified, historical information can be used 
to identify exactly which hosts or users open similar or 
identical files. For example, if a new botnet is detected, 
an in-cloud antivirus service can use the execution his- 
tory of hosts on a network to identify which hosts have 
been infected and notify administrators or even automat- 
ically quarantine infected hosts. (4) Improved deploya- 
bility and management: Moving detection off the host 
and into the network significantly simplifies host soft- 
ware enabling deployment on a wider range of platforms 
and enabling administrators to centrally control signa- 
tures and enforce file access policies. 

To explore and validate this new antivirus model, we 
propose an in-cloud antivirus architecture that consists 
of three major components: a lightweight host agent run 
on end hosts like desktops, laptops, and mobiles devices 
that identifies new files and sends them into the network 
for analysis; a network service that receives files from 
hosts and identifies malicious or unwanted content; and 
an archival and forensics service that stores information 
about analyzed files and provides a management inter- 
face for operators. 

We construct, deploy, and evaluate a production qual- 
ity in-cloud antivirus system called CloudAV. CloudAV 
includes a lightweight, cross-platform host agent for 
Windows, Linux, and FreeBSD and a network service 
consisting of ten antivirus engines and two behavioral 
detection engines. We provide a detailed evaluation of 
the system using a dataset of 7220 malware samples col- 
lected in the wild over a period of a year [20] and a pro- 
duction deployment of our system on a campus network 
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in computer labs spanning multiple departments for a pe- 
riod of over 6 months. 

Using the malware dataset, we show how the Clou- 
dAV N-version protection approach provides 35% better 
detection coverage against recent threats compared to a 
single antivirus engine and 98% detection coverage of 
the entire dataset compared to 83% with a single engine. 
In addition, we empirically find that the average length of 
time to detect new threats by a single engine is 48 days 
and show how retrospective detection can greatly mini- 
mize the impact of this delay. 

Finally, we analyze the performance and scalability of 
the system using deployment results and show that while 
the total number of executables run by all the systems in 
a computing lab is quite large (an average of 20,500 per 
day), the number of unique executables run per day is 
two orders of magnitude smaller (an average of 217 per 
day). This means that the caching mechanisms employed 
in the network service achieves a hit rate of over 99.8%, 
reducing the load on the network and, in the rare case 
of a cache miss, we show that the average time required 
to analyze a file using CloudAV’s detection engines is 
approximately 1.3 seconds. 


2 Limitations of Antivirus Software 


Antivirus software is one of the most successful and 
widely used tools for detecting and stopping malicious 
and unwanted software. Antivirus software is deployed 
on most desktops and workstations in enterprises across 
the world. The market for antivirus and other security 
software is estimated to increase to over $10 billion dol- 
lars in 2008 [10]. 

The ubiquitous deployment of antivirus software is 
closely tied to the ever-expanding ecosystem of mali- 
cious software and tools. As the construction of mali- 
cious software has shifted from the work of novices to a 
commercial and financially lucrative enterprise, antivirus 
vendors must expend more resources to keep up. The 
rise of botnets and targeted malware attacks for the pur- 
poses of spam, fraud, and identity theft present an evolv- 
ing challenge for antivirus companies. For example, the 
recent Storm worm demonstrated the use of encrypted 
peer-to-peer command and control, and the rapid deploy- 
ment of new variants to continually evade the signatures 
of antivirus software [4]. 

However, two important trends call into question the 
long term effectiveness of products from a single an- 
tivirus vendor. The first is that antivirus software fails 
to detect a significant percentage of malware in the wild. 
Moreover, there is a significant vulnerability window be- 
tween when a threat first appears and when antivirus ven- 
dors generate a signature or modify their software to de- 
tect the threat. This means that end systems with the 
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AV Vendor Version 3 Months 1 Month 1 Week 
Avast 4.7.1043 62.7% 45.8% 39.6% | 
AVG 7.5.503 83.8% 78.6% 72.2% | 
BitDefender 71.2559 83.9% 719.7% 78.5% | 
ClamAV 0.91.2 57.5% 48.8% 46.8% | 
CWSandbox | 2.0 N/A N/A N/A | 
F-Prot 6.0.8.0 70.4% 49.6% 46.0% | 
F-Secure 8.00.101 80.9% T4.4% 60.3% 
Kaspersky 7.0.0.125 89.2% 84.0% 78.5% | 
McAfee 8.5.01 70.5% 56.7% 53.9% | 
Norman 1.8 N/A N/A N/A | 
Symantec 15.0.0.58 60.8% 38.8% 45.2% 
Trend Micro 16.00 79.4% 74.6% 75.3% | 
(a) 
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Figure 1: Detection rate for ten popular antivirus products as a function of the age of the malware samples. 


latest antivirus software and signatures can still be vul- 
nerable for long periods of time. The second important 
trend is that the increasing complexity of antivirus soft- 
ware and services has indirectly resulted in vulnerabili- 
ties that can and are being exploited by malware. That 
is, malware is actually using vulnerabilities in antivirus 
software as means to infect systems. 


2.1 Vulnerability Window 


The sheer volume of new threats means that it is diffi- 
cult for any one antivirus vendor to create signatures for 
all new threats. The ability of any single vendor to cre- 
ate signatures is dependent on many factors such as de- 
tection algorithms, collection methodology of malware 
samples, and response time to 0-day malware. The end 
result is that there is a significant period of time between 
when a threat appears and when a signature is created by 
antivirus vendors (the vulnerability window). 

To quantify the vulnerability window, we analyzed the 
detection rate of multiple antivirus engines across mal- 
ware samples collected over a one year period. The 
dataset included 7220 samples that were collected be- 
tween November 11th, 2006 to November 10th, 2007. 
The malware dataset is described in further detail in Sec- 
tion 6. The signatures used for the antivirus were updated 
the day after collection ended, November 1 1 th, 2007, and 
stayed constant through the analysis. 

In the first experiment, we analyzed the detection of 
recent malware. We created three groups of malware: 
one that included malware collected more recently than 
3 months ago, one that included malware collected more 
recently than 1 month ago, and one that included mal- 
ware collected more recently than 1 week ago. The an- 
tivirus engine and signature versions along with their as- 
sociated detection rates for each time period are listed 
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in Figure I(a). The table clearly shows that the detec- 
tion rate decreases as the malware becomes more recent. 
Specifically, the number of malware samples detected in 
the 1 week time period, arguably the most recent and im- 
portant threats, is quite low. 

In the second experiment, we extended this analysis 
across all the days in the year over which the malware 
samples were collected. Figure 1(b) shows significant 
degradation of antivirus engine detection rates as the age, 
or recency, of the malware sample is varied. As can 
be seen in the figure, detection rates can drop over 45% 
when one day’s worth of malware is compared to a year’s 
worth. As the plot shows, antivirus engines tend to be ef- 
fective against malware that is a year old but much less 
useful in detecting more recent malware, which pose the 
greatest threat to end hosts. 


2.2 Antivirus Software Vulnerabilities 


A second major concern about the long term viability 
of host-based antivirus software is that the complexity 
of antivirus software has resulted in an increased risk 
of security vulnerabilities. Indeed, severe vulnerabil- 
ities have been discovered in the antivirus engines of 
nearly every vendor. While local exploits are more com- 
mon (ioct1 vulnerabilities, overflows in decompression 
routines, etc), remote exploits in management interfaces 
have been observed in the wild [30]. Due to the inherent 
need for elevated privileges by antivirus software, many 
of these vulnerabilities result in a complete compromise 
of the affected end host. 

Figure 2 shows the number of vulnerabilities reported 
in the National Vulnerability Database [21] for ten popu- 
lar antivirus vendors between 2005 and November 2007. 
This large number of reported vulnerabilities demon- 
strates not only the risk involved in deploying antivirus 
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Figure 2: Number of vulnerabilities reported in the Na- 
tional Vulnerability Database (NVD) for ten antivirus 
vendors between 2005 and November 2007 


software, but also an evolution in tactics as attackers are 
now targeting vulnerabilities in antivirus software itself. 


3 Approach 


This paper advocates a new model for the detection 
functionality currently performed by antivirus software. 
First, the detection capabilities currently provided by 
host-based antivirus software can be more efficiently and 
effectively provided as an in-cloud network service. Sec- 
ond, the identification of malicious and unwanted soft- 
ware should be determined by multiple, heterogeneous 
detection engines in parallel. 


3.1 Deployment Environment 


Before getting into details of the approach, it is impor- 
tant to understand the environment in which such an ar- 
chitecture is most effective. First and foremost, we do 
not see the architecture replacing existing antivirus or in- 
trusion detection solutions. We base our approach on the 
same threat model as existing host-based antivirus solu- 
tions and assume an in-cloud antivirus service would run 
as an additional layer of protection to augment existing 
security systems such as those inside an organizational 
network like an enterprise. Some possible deployment 
environments include: 


e Enterprise networks: Enterprise networks tend to 
be highly controlled environments in which IT ad- 
ministrators control both desktop and server soft- 
ware. In addition, enterprises typically have good 
network connectivity with low latencies and high 
bandwidth between workstations and back office 
systems. 


e Government networks: Like enterprise networks, 
government networks tend to be highly controlled 
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with strictly enforced software and security prac- 
tices. In addition, policy enforcement, access con- 
trol, and forensic logging can be useful in tracking 
sensitive information. 


e Mobile/Cellular networks: The rise of ubiqui- 
tous WiFi and mobile 2.5G and 3G data networks 
also provide an excellent platform for a provider- 
managed antivirus solution. As mobile devices be- 
come increasingly complex, there is an increasing 
need for mobile security software. Antivirus soft- 
ware has recently become available from multiple 
vendors for mobile phones [9, 13, 31]. 


Privacy implications: Shifting file analysis to a central 
location provides significant benefits but also has impor- 
tant privacy implications. It is critical that users of an in- 
cloud antivirus solution understand that their files may 
be transferred to another computer for analysis. There 
are may be situations where this might not be acceptable 
to users (e.g. many law firms and many consumer broad- 
band customers). However, in controlled environments 
with explicit network access policies, like many enter- 
prises, such issues are less of a concern. Moreover, the 
amount of information that is collected can be carefully 
controlled depending on the environment. As we will 
discuss later, information about each file analyzed and 
what files are cached can be controlled depending on the 
policies of the network. 


3.2. In-Cloud Detection 


The core of the proposed approach is moving the detec- 
tion of malicious and unwanted files from end hosts and 
into the network. This idea was originally introduced in 
[23] and we significantly extend and evaluate the concept 
in this paper. 

There is currently a strong trend toward moving ser- 
vices from end host and monolithic servers into the net- 
work cloud. For example, in-cloud email [5, 7, 28] and 
HTTP [18, 25] filtering systems are already popular and 
are used to provide an additional layer of security for 
enterprise networks. In addition, there have been sev- 
eral attempts to provide network services as overlay net- 
works [29, 33]. 

Moving the detection of malicious and unwanted files 
into the network significantly lowers the complexity of 
host-based monitoring software. Clients no longer need 
to continually update their local signature database, re- 
ducing administrative cost. Simplifying the host soft- 
ware also decreases the chance that it could contain ex- 
ploitable vulnerabilities [15, 30]. Finally, a lightweight 
host agent allows the service to be extended to mobile 
and resource-limited devices that lack sufficient process- 
ing power but remain an enticing target for malware. 
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3.3. N-Version Protection 


The second core component of the proposed approach 
is a set of heterogeneous detection engines that are used 
to provide analysis results on a file, also known as N- 
version protection. This approach is very similar to N- 
version programming, a paradigm in which multiple im- 
plementations of critical software are written by inde- 
pendent parties to increase the reliability of software by 
reducing the probability of concurrent failures [2]. Tra- 
ditionally, N-version programming has been applied to 
systems requiring high availability such as distributed 
filesystems [26]. N-version programming has also been 
applied to security realm to detect implementation faults 
in web services that may be exploited by an attacker [19]. 
While N-version programming uses multiple implemen- 
tations to increase fault tolerance in complex software, 
the proposed approach uses multiple independent im- 
plementations of detection engines to increase coverage 
against a highly complex and ever-evolving ecosystem of 
malicious software. 

A few online services have recently been constructed 
that implement N-version detection techniques. For ex- 
ample, there are online web services for malware sub- 
mission and analysis [6, 11, 22]. However, these services 
are designed for the occasional manual upload of a virus 
sample, rather than the automated and real-time protec- 
tion of end hosts. 


4 Architecture 


In order to move the detection of malicious and unwanted 
files from end hosts and into the network, several impor- 
tant challenges must be overcome: (1) unlike existing 
antivirus software, files must transported into the net- 
work for analysis; (2) an efficient analysis system must 
be constructed to handle the analysis of files from many 
different hosts using many different detection engines in 
parallel; and (3) the performance of the system must be 
similar or better than existing detection systems such as 
antivirus software. 

To address these problems we envision an architec- 
ture that includes three major components. The first is a 
lightweight host agent run on end systems like desktops, 
laptops, and mobiles devices that identifies new files and 
sends them into the network for analysis. The second is 
a network service that receives files from the host agent, 
identifies malicious and unwanted content, and instructs 
hosts whether access to the files is safe. The third com- 
ponent is an archival and forensics service that stores in- 
formation about what files were analyzed and provides 
a query and alerting interface for operators. Figure 3 
shows the high level architecture of the approach. 
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4.1 Client Software 


Malicious and unwanted files can enter an organization 
from many sources. For example, mobile devices, USB 
drives, email attachments, downloads, and vulnerable 
network services are all common entry points. Due to 
the broad range of entry vectors, the proposed architec- 
ture uses a lightweight file acquisition agent run on each 
end system. 

Just like existing antivirus software, the host agent 
runs on each end host and inspects each file on the sys- 
tem. Access to each file is trapped and diverted to a han- 
dling routine which begins by generating a unique identi- 
fier (UID) of the file and comparing that identifier against 
a cache of previously analyzed files. If a file UID is not 
present in the cache then the file is sent to the in-cloud 
network service for analysis. 

To make the analysis process more efficient, the archi- 
tecture provides a method for sending a file for analysis 
as soon as it is written on the end host’s filesystem (e.g., 
via file-copy, installation, or download). Doing so amor- 
tizes the transmission and analysis cost over the time 
elapsed between file creation and system or user-initiated 
access. 


4.1.1 Threat Model 


The threat model for the host agent is similar to that 
of existing software protection mechanisms such as an- 
tivirus, host-based firewalls, and host-based intrusion de- 
tection. As with these host-based systems, if an attacker 
has already achieved code execution privileges, it may be 
possible to evade or disable the host agent. As described 
in Section 2, antivirus software contains many vulnera- 
bilities that can be directly targeted by malware due to 
its complexity. By reducing the complexity of the host 
agent by moving detection into the network, it is possi- 
ble to reduce the vulnerability footprint of host software 
that may lead to elevated privileges or code execution. 


4.1.2 File Unique Identifiers 


One of the core components of the host agent is the file 
unique identifier (UID) generator. The goal of the UID 
generator is to provide a compact summary of a file. That 
summary is transmitted over the network to determine if 
an identical file has already been analyzed by the net- 
work service. One of the simplest methods of generat- 
ing such a UID is a cryptographic hash of a file, such as 
MD5 or SHA-1. Cryptographic hashes are fast and pro- 
vide excellent resistance to collision attacks. However, 
the same collision resistance also means that changing a 
single byte in a file results in completely different UID. 
To combat polymorphic threats, a more complex UID 
generator algorithm could be employed. For example, 
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Figure 3: Architectural approach for in-cloud file analysis service. 


a method such as locality-preserving hashing in multi- 
dimensional spaces [12] could be used track differences 
between two files in a compact manner. 


4.1.3 User Interface 


We envision three majors modes of operation that affect 
how users interact with the host agent that range from 
less to more interactive. 


e Transparent mode: In this mode, the detection 
software is completely transparent to the end user. 
Files are sent into the cloud for analysis but the ex- 
ecution or loading of a file is never blocked or inter- 
rupted. In this mode end hosts can become infected 
by known malware but administrators can use de- 
tection alerts and detailed forensic information to 
aid in cleaning up infected systems. 


e Warning mode: In this mode, access to a file is 
blocked until an access directive has been returned 
to the host agent. If the file is classified as unsafe 
then a warning is presented to the user instructing 
them why the file is suspicious. The user is then 
allowed to make the decision of whether to proceed 
in accessing the file or not. 


e Blocking mode: In this mode, access to a file is 
blocked until an access directive has been returned 
to the host agent. If the file is classified as suspi- 
cious then access to the file is denied and the user is 
informed with an error dialog. 


4.1.4 Other File Acquisition Methods 


While the host agent is the primary method of acquiring 
candidate files and transmitting them to the network ser- 
vice for analysis, other methods can also be employed 
to increase the performance and visibility of the system. 
For example, a network sensor or tap monitoring the traf- 
fic of a network may pull files directly out of a network 
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stream using deep packet inspection (DPI) techniques. 
By identifying files and performing analysis before the 
file even reaches the destination host, the need to retrans- 
mit the file to the network service is alleviated and user- 
perceived latencies can be reduced. Clearly this approach 
cannot completely replace the host agent as network traf- 
fic can be encrypted, files may be encapsulated in un- 
known protocols, and the network is only one source of 
malicious content. 


4.2 Network Service 


The second major component of the architecture is the 
network service responsible for file analysis. The core 
task of the network service is to determine whether a file 
is malicious or unwanted. Unlike existing systems, each 
file is analyzed by a collection of detection engines. That 
is, each file is analyzed by multiple detection engines in 
parallel and a final determination of whether a file is ma- 
licious or unwanted is made by aggregating these indi- 
vidual results into a threat report. 


4.2.1 Detection Engines 


A cluster of servers can quickly analyze files using mul- 
tiple detection techniques. Additional detection engines 
can easily be integrated into a network service, allow- 
ing for considerable extensibility. Such comprehensive 
analysis can significantly increase the detection cover- 
age of malicious software. In addition, the use of en- 
gines from different vendors using different detection 
techniques means that the overall result does not rely too 
heavily on a single vendor or detection technology. 

A wide range of both lightweight and heavyweight de- 
tection techniques can be used in the backend. For exam- 
ple, lightweight detection systems like existing antivirus 
engines can be used to evaluate candidate files. In addi- 
tion, more heavyweight detectors like behavioral analyz- 
ers can also be used. A behavioral system executes a sus- 
picious file in a sandboxed environment (e.g., Norman 
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Sandbox [22], CWSandbox [6]) or virtual machine and 
records host state changes and network activity. Such 
deep analysis is difficult or impossible to accomplish on 
resource-constrained devices like mobile phones but is 
possible when detection is moved to dedicated servers. 
In addition, instead of forcing signature updates to every 
host, detection engines can be kept up-to-date with the 
latest vendor signatures at a central source. 

Finally, running multiple detection engines within the 
same service provides the ability to correlation informa- 
tion between engines. For example, if a detector finds 
that the behavior of an unknown file is similar to that of 
an file previously classified as malicious by antivirus en- 
gines, the unknown file can be marked as suspicious [23]. 


4.2.2 Result Aggregation 


The results from the different detection engines must be 
combined to determine whether a file is safe to open, ac- 
cess, or execute. Several variables may impact this pro- 
cess. 

First, results from the detection engines may reach the 
aggregator at different times — if a detector fails, it may 
never return any results. In order to prevent a slow or 
failed detector from holding up a host, the aggregator can 
use a subset of results to determine if a file is safe. Deter- 
mining the size of such a quorum depends on the deploy- 
ment scenario and variables like the number of detection 
engines, security policies, and latency requirements. 

Second, the metadata returned by each detector may 
be different so the detection results are wrapped in a con- 
tainer object that describes how the data should be inter- 
preted. For example, behavioral analysis reports may not 
indicate whether a file is safe but can be attached to the 
final aggregation report to help users, operators, or exter- 
nal programs interpret the results. 

Lastly, the threshold at which a candidate file is 
deemed unsafe or malicious may be defined by secu- 
rity policy of the network’s administrators. For example, 
some administrators may opt for a strict policy where a 
single engine is sufficient to deem a file malicious while 
less security-conscious administrators may require mul- 
tiple engines to agree to deem a file malicious. We dis- 
cuss the balance between coverage and confidence fur- 
ther in Section 7. 

The result of the aggregation process is a threat report 
that is sent to the host agent and can be cached on the 
server. A threat report can contain a variety of metadata 
and analysis results about a file. The specific contents 
of the report depend on the deployment scenario. Some 
possible report sections include: (1) an operation direc- 
tive; a set of instructions indicating the action to be per- 
formed by the host agent, such as how the file should 
be accessed, opened, executed, or quarantined; (2) fam- 
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ily/variant labels; a list of malware family/variant classi- 
fication labels assigned to the file by the different detec- 
tion engines; and (3) behavioral analysis; a list of host 
and network behaviors observed during simulation. This 
may include information about processes spawned, files 
and registry keys modified, network activity, or other 
state changes. 


4.2.3 Caching 


Once a threat report has been generated for a candidate 
file, it can be stored in both a local cache on the host 
agent and in a shared remote cache on the server. This 
means that once a file has been analyzed, subsequent ac- 
cesses to that file by the user can be determined locally 
without requiring network access. Moreover, once a sin- 
gle host in a network has accessed a file and sent it to 
the network service for analysis, any subsequent access 
of the same file by other hosts in the network can lever- 
age the existing threat report in the shared remote cache 
on the server. Cached reports stored in the network ser- 
vice may also periodically be pushed to the host agent to 
speed up future accesses and invalidated when deemed 
necessary. 


4.3. Archival and Forensics Service 


The third and final component of the architecture is a ser- 
vice that provides information on file usage across partic- 
ipating hosts which can assist in post-infection forensic 
analysis. While some forensics tracking systems [14, 8] 
provide fine-grained details tracing back to the exact vul- 
nerable processes and system objects involved in an in- 
fection, they are often accompanied by high storage re- 
quirements and performance degradation. Instead, we 
opt for a lightweight solution consisting of file access in- 
formation sent by the host agent and stored securely by 
the network service, in addition to the behavioral pro- 
files of malicious software generated by the behavioral 
detection engines. Depending on the privacy policy of 
organization, a tunable amount of forensics information 
can be logged and sent to the archival service. For exam- 
ple, a more security conscious organization could spec- 
ify that information about every executable launch be 
recorded and sent to the archival service. Another pol- 
icy might specify that only accesses to unsafe files be 
archived without any personally identifiable information. 

Archiving forensic and file usage information provides 
a rich information source for both security professionals 
and administrators. From a security perspective, tracking 
the system events leading up to an infection can assist 
in determining its cause, assessing the risk involved with 
the compromise, and aiding in any necessary disinfection 
and cleanup. In addition, threat reports from behavioral 
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engines provide a valuable source of forensic data as the 
exact operations performed by a piece of malicious soft- 
ware can be analyzed in detail. From a general adminis- 
tration perspective, knowledge of what applications and 
files are frequently in use can aid the placement of file 
caches, application servers, and even be used to deter- 
mine the optimal number of licenses needed for expen- 
sive applications. 

Consider the outbreak of a zero-day exploit. An en- 
terprise might receive a notice of a new malware attack 
and wonder how many of their systems were infected. 
In the past, this might require performing an inventory 
of all systems, determining which were running vulnera- 
ble software, and then manually inspecting each system. 
Using the forensics archival interface in the proposed ar- 
chitecture, an operator could search for the UID of the 
malicious file over the past few months and instantly find 
out where, when, and who opened the file and what mali- 
cious actions the file performed. The impacted machines 
could then immediately be quarantined. 

The forensics archive also enables retrospective detec- 
tion. The complete archive of files that are transmitted to 
the network service may be re-scanned by available en- 
gines whenever a signature update occurs. Retrospective 
detection allows previously undetected malware that has 
infected a host to be identified and quarantined. 


5 CloudAV Implementation 


To explore and validate the proposed in-cloud antivirus 
architecture, we constructed a production quality imple- 
mentation called CloudAV. In this section we describe 
how CloudAV implements each of the three main com- 
ponents of the architecture. 


5.1 Host Agent 


We implement the host agent for a variety of platforms 
including Windows 2000/XP/Vista, Linux 2.4/2.6, and 
FreeBSD 6.0+. The implementation of the host agent is 
designed to acquire executable files for analysis by the 
in-cloud network service, as executables are a common 
source of malicious content. We discuss how the agent 
can be extended to acquire DLLs, documents, and other 
common malcode-bearing files types in Section 7. 
While the exact APIs are platform dependent (Cre- 
ateProcess on Win32, execve syscall on Linux 2.4, LSM 
hooks on Linux 2.6, etc), the host agent hooks and in- 
terposes on system events. This interposition is im- 
plemented via the MadCodeHook [16] package on the 
Win32 platform and via the Dazuko [24] framework for 
the other platforms. Process creation events are inter- 
posed upon by the host agent to acquire and process can- 
didate executables before they are allowed to continue. 
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In addition, filesystem events are captured to identify 
new files entering a host and preemptively transfer them 
to the network service before execution to eliminate any 
user-perceived latencies. 

As motivating factors of our work include the com- 
plexity and security risks involved in running host-based 
antivirus, the host agent was designed to be simple and 
lightweight, both in code size and resource requirements. 
The Win32 agent is approximately 1500 lines of code of 
which 60% is managed code, further reducing the vul- 
nerability profile of the agent. The agent for the other 
platforms is written in python and is under 300 lines of 
code. 

While the host agent is primarily targeted at end hosts, 
our architecture is also effective in other deployment sce- 
narios such as mail servers. To demonstrate this, we also 
implemented a milter (mail filter) frontend for use with 
mail transfer agents (MTAs) such as Sendmail and Post- 
fix to scan all attachments on incoming emails. Using 
the pymilter API, the milter frontend weighs in at ap- 
proximately 100 lines of code. 


5.2 Network Service 


The network service acts as a dispatch manager between 
the host agent and the backend analysis engines. Incom- 
ing candidate files are received, analyzed, and a threat 
report is returned to the host agent dictating the appro- 
priate action to take. Communication between the host 
agent and the network service uses a HTTP wire protocol 
protected by mutually authenticated SSL/TLS. Between 
the components within the network service itself, com- 
munication is performed via a publish/subscribe bus to 
allow modularization and effective scalability. 

The network service allows for various priorities to be 
assigned to analysis requests to aid latency-sensitive ap- 
plications and penalize misbehaving hosts. For example, 
application and mail scanning may take higher analysis 
priority than background analysis tasks such as retroac- 
tive detection (described in Section 7). This also enables 
the system to penalize or temporarily suspend misbehav- 
ing hosts than may try to submit many analysis requests 
or otherwise flood the system. 

Each backend engine runs in a Xen virtualized con- 
tainer, which offers significant advantages in terms of 
isolation and scalability. Given the numerous vulnera- 
bilities in existing antivirus software discussed in Sec- 
tion 2, isolation of the antivirus engines from the rest of 
the system is vital. If one of the antivirus engines in the 
backend is targeted and successfully exploited by a mali- 
cious candidate file, the virtualized container can simply 
be disposed of and immediately reverted to a clean snap- 
shot. As for scalability, virtualized containers allows the 
network service to spin up multiple instances of a partic- 
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Figure 4: Screen captures of the detection engine VM monitoring interface (a) and the web management portal which 


provides access to forensic data and threat reports (b). 


ular engine when demand for its services increase. 

Our current implementation employs 12 engines: 10 
traditional antivirus engines (Avast, AVG, BitDefender, 
ClamAV, F-Prot, F-Secure, Kaspersky, McAfee, Syman- 
tec, and Trend Micro) and 2 behavioral engines (Nor- 
man Sandbox and CWSandbox). The exact version of 
each detection engine is listed in Figure I(a). 9 of the 
backend engines run in a Windows XP environment us- 
ing Xen’s HVM capabilities while the other 3 run in a 
Gentoo Linux environment using Xen domU paravirtu- 
alization. Implementing each particular engine for the 
backend is a simple task and extending the backend with 
additional engines in the future is equally as simple. For 
reference, the amount of code required for each engine is 
42 lines of python code on average with a median of 26 
lines of code. 


5.3. Management Interface 


The third component is a management interface which 
provides access to the forensics archive, policy enforce- 
ment, alerting, and report generation. These inter- 
faces are exposed to network administrators via a web- 
based management interface. The web interface is im- 
plemented using Cherrypy, a python web development 
framework. A screen capture of the dashboard of the 
management interface is depicted in Figure 4. 

The centralized management and network-based ar- 
chitecture allows for administrators to enforce network- 
wide policies and define alerts when those policies are 
violated. Alerts are defined through a flexible specifica- 
tion language consisting of attributes describing an ac- 
cess request from the host agent and boolean predicates 
similar to an SQL WHERE clause. The specification 
language allows for notification for triggered alerts (via 
email, syslog, SNMP) and enforcement of administrator- 
defined policies. 
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For example, network administrators may desire to 
block certain applications from being used on end hosts. 
While these unwanted applications may not be explic- 
itly malicious, they may have a negative effect on host or 
network performance or be against acceptable use poli- 
cies. We observed several classes of these potentially 
unwanted applications in our production deployment in- 
cluding P2P applications (uTorrent, Limewire, etc) and 
multi-player gaming (World of Warcraft, online poker, 
etc). Other policies can be defined to reinforce prudent 
security practices, such as blocking the user from execut- 
ing attachments from an email application. 


6 Evaluation 


In this section, we provide an evaluation of the proposed 
architecture through two distinct sources of data. The 
first source is a dataset of malicious software collected 
over a period of a year. Using this dataset, we evaluate 
the effectiveness of N-version protection and retrospec- 
tive detection. We also utilize this malware dataset to 
empirically quantify the size of vulnerability window. 


The second data source is derived from a production 
deployment of the system on a campus network in com- 
puter labs spanning multiple departments for a period of 
over 6 months. We use the data collected from this de- 
ployment to explore the performance characteristics of 
CloudAV. For example, we analyze the number of files 
handled by the network service, the utility of the caching 
system, and the time it takes the detection engines to ana- 
lyze individual files. In addition, we use deployment data 
to demonstrate the forensics capabilities of the approach. 
We detail two real-world case studies from the deploy- 
ment, one involving an infection by malicious software 
and one involving a suspicious, yet legitimate executable. 
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Engines || 3 Months | 1 Month | 1 Week 
1 73.9% 63.1% 59.6% 
2 87.7% 81.0% 77.6% 
3 92.0% 87.8% 84.8% 
4 93.8% 90.9% 88.4% 
2 94.8% 92.4% 90.5% 
6 95.4% 93.4% 91.8% 
7 95.9% 94.0% 92.8% 
8 96.2% 94.5% 93.5% 
9 96.5% 94.8% 94.0% 
10 96.7% 95.0% 94.4% 
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Figure 5: The average detection coverage for the various datasets (a) and the continuous coverage over time (b) when 


a given number of engines are used in parallel. 


6.1 Malware Dataset Results 


The first component of the evaluation is based on a mal- 
ware dataset obtained through Arbor Network’s Arbor 
Malware Library (AML) [20]. AML is composed of mal- 
ware collected using a variety of techniques such as dis- 
tributed darknet honeypots, spam traps, and honeyclient 
spidering. The use of a diverse set of collection tech- 
niques means that the malware samples are more rep- 
resentative of threats faced by end hosts than malware 
datasets collected using only a single collection method- 
ology such as Nepenthes [3]. The AML dataset used in 
this paper consists of 7220 unique malware samples col- 
lected over a period of one year (November 12th, 2006 to 
November 11th, 2007). An average of 20 samples were 
collected each day with a standard deviation of 19.6 sam- 
ples. 


6.1.1 N-Version Protection 


We used the AML malware dataset to assess the effec- 
tiveness of a set of heterogeneous detection engines. Fig- 
ure 5(a) and (b) show the overall detection rate across dif- 
ferent time ranges of malware samples as the number of 
detection engines is increased. The detection rates were 
determined by looking at the average performance across 
all combinations of N engines for a given N. For exam- 
ple, the average detection rate across all combinations of 
two detection engines over the most recent 3 months of 
malware was 87.7%. 

Figure 5(a) demonstrates how the use of multiple het- 
erogeneous engines allows CloudAV to significantly im- 
prove the aggregate detection rate. Figure 5(b) shows the 
detection rate over malware samples ranging from one 
day old to one year old. The graph shows how using 
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ten engines can increase the detection rate for the entire 
year-long AML dataset as high as 98%. 

The graph also reveals that CloudAV significantly im- 
proves the detection rate of more recent malware. When 
a single antivirus engine is used, the detection rate de- 
grades from 82% against a year old dataset to 52% 
against a day old dataset (a decrease of 30%). How- 
ever, using ten antivirus engines the detection coverage 
only goes from 98% down to 88% (a decrease of only 
10%). These results show that not only do multiple en- 
gines complement each other to provide a higher detec- 
tion rate, but the combination has resistance to coverage 
degradation as the encountered threats become more re- 
cent. As the most recent threats are typically the most 
important, a detection rate of 88% versus 52% is a sig- 
nificant advantage. 

Another noticeable feature of Figure 5 is the decrease 
in incremental coverage. Moving from | to 2 engines 
results in a large jump in detection rate, moving from 
2 to 3 is smaller, moving from 3 to 4 is even smaller, 
and so on. The diminishing marginal utility of additional 
engines shows that a practical balance may be reached 
between detection coverage and licensing costs, which 
we discuss further in Section 7. 

In addition to the averages presented in Figure 5, the 
minimum and maximum detection coverage for a given 
number of engines is of interest. For the one week time 
range, the maximum detection coverage when using only 
a single engine is 78.6% (Kaspersky) and the minimum 
is 39.7% (Avast). When using 3 engines in parallel, 
the maximum detection coverage is 93.6% (BitDefender, 
Kaspersky, and Trend Micro) and the minimum is 69.1% 
(ClamAV, F-Prot, and McAfee). However, the optimal 
combination of antivirus vendors to achieve the most 
comprehensive protection against malware may not be 
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a simple measure of total detection coverage. Rather, a 
number of complex factors may influence the best choice 
of detection engines, including the types of threats most 
commonly faced by the hosts being protected, the algo- 
rithms used for detection by a particular vendor, the ven- 
dor’s response time to 0-day malware, and the collection 
methodology and visibility employed by the vendor to 
collect new malware. 


6.2 Retrospective Detection 


We also used the AML malware dataset to understand the 
utility of retrospective detection. Recall that retrospec- 
tive detection is the ability to use historical information 
and archived files stored by CloudAV to retrospectively 
detect and identify hosts infected that with malware that 
has previously gone undetected. Retrospective detection 
is an especially important post-infection defense against 
0-day threats and is independent of the number or vendor 
of antivirus engines employed. Imagine a polymorphic 
threat not detected by any antivirus or behavioral engine 
that infects a few hosts on a network. In the host-based 
antivirus paradigm, those hosts could become infected, 
have their antivirus software disabled, and continue to be 
infected indefinitely. 


In the proposed system, the infected file would be 
sent to the network service for analysis, deemed clean, 
archived at the network service, and the host would be- 
come infected. Then, when any of the antivirus ven- 
dors update their signature databases to detect the threat, 
the previously undetected malware can be re-scanned 
in the network service’s archive and flagged as mali- 
cious. Instantly, armed with this new information, the 
network service can identify which hosts on the network 
have been infected in the past by this malware from its 
database of execution history and notify the administra- 
tors with detailed forensic information. 


Retrospective detection is especially important as fre- 
quent signature updates from vendors continually add 
coverage for previously undetected malware. Using our 
AML dataset and an archive of a year’s worth of McAfee 
DAT signature files (with a one week granularity), we de- 
termined that approximately 100 new malware samples 
were detected each week on average (with a standard de- 
viation of 57) by the McAfee updates. More importantly, 
for those samples that were eventually detected by a sig- 
nature update (5147 out of 7220), the average time from 
when a piece of malware was observed to when it was 
detected (i.e. the vulnerability window) was approxi- 
mately 48 days. A cumulative distribution function of 
the days between observation and detection is depicted 
in Figure 6. 
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Figure 6: Cumulative distribution function depicting the 
number of days between when a malware sample is ob- 
served and when it is first detected by the McAfee an- 
tivirus engine. 


6.3 Deployment Results 


With the aid of network operations and security staff we 
deployed CloudAV across a large campus network. In 
this section, we discuss results based on the data col- 
lected as a part of this deployment. 


6.3.1 Executable Events 


One of the core variables that impacts the resource re- 
quirements of the network service is the rate at which 
new files must be analyzed. If this rate is extremely high, 
extensive computing resources will be required to handle 
the analysis load. Figure 7 shows the number of total ex- 
ecution events and unique executables observed during a 
one month period in a university computing lab. 

Figure 7 shows that while the total number of executa- 
bles run by all the systems in the lab is quite large (an 
average of 20,500 per day), the number of unique exe- 
cutables run per day is two orders of magnitude smaller 
(an average of 217 per day). Moreover, the number of 
unique executables is likely inflated due to the fact that 
these machines are frequently used by students to work 
on computer science class projects, resulting in a large 
number of distinct executables with each compile of a 
project. A more static, non-development environment 
would likely see even less unique executables. 

We also investigated the origins of these executables 
based on the file path of 1000 unique executables stored 
in the forensics archive. Table 1 shows the break down 
of these sources. The majority of executables originate 
from the local hard drive but a significant portion were 
launched from various network sources. Executables 
from the temp directory often indicate that they were 
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Figure 7: Executable launches (a) and unique executable launches (b) per day over a one month period in a represen- 
tative sample of 50 machines in the deployment. 















































Program Files 22.3% 

Local Drives Temp Directory 14.2% 
52.4% Windows Directory 13.4% 
Other 2.4% 

Engineering Apps 23.6% 

Network Drives | User Desktop Shares | 9.3% 
43.3% User AFS Shares 8.3% 
Other 2.1% 

External Media | USB Flash 24% 
4.4% CDROM Drive 2.0% 








Table 1: A distribution of the sources of 1000 executa- 
bles observed in during the deployment of our host agent 
over a six-month period. 


downloaded via a web browser and executed, contribut- 
ing even more to networked origins. In addition, a non- 
trivial number of executables were introduced to the sys- 
tem directly from external media such as a CDROM 
drive and USB flash media. This diversity exemplifies 
the need for a host agent that is capable of acquiring files 
from a variety of sources. 


6.3.2 Caching and Performance 


A second important variable that determines the scalabil- 
ity and performance of the system is the cache hit rate. A 
hit in the local cache can prevent network requests, and 
a hit in the remote cache can prevent unnecessary files 
transfers to the network service. The hosts instrumented 
as a part of the deployment were heavily loaded Win- 
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dows XP workstations. The Windows Start Menu con- 
tained over 250 executable applications including a wide 
range of internet, multimedia, and engineering packages. 


Our results indicate that 10 processes were launched 
from when the host agent service loads to when the login 
screen appears and another 52 processes were launched 
before the user’s desktop loaded. As a measure of over- 
head, we measured the number of bytes transferred be- 
tween a specific client and network service under dif- 
ferent caching conditions. With a warm remote cache, 
the boot-up process took 8.7 KB and the login process 
took 46.2 KB. In the case of a cold remote cache, which 
would only ever occur a single time when the first host in 
the network loaded for the first time, the boot-up process 
took 406 KB and the login process took 12.5 MB. For 
comparison, the Active Directory service installed on the 
deployment machines took 171 KB and 270 KB on boot 
and login respectively. 

It is also possible to evaluate the performance of the 
caching system by looking at Figure 7. We recorded al- 
most over 615,000 total execution events over one month 
yet only observed 1300 unique executables. As a remote 
cache miss only happens when a new executable is ob- 
served, the remote cache hit rate is approximately 99.8%. 
Even more significant, the local cache can be pre-seeded 
with known installed software during the host agent in- 
stallation process, improving the hit rate further. In the 
infrequent case when a miss occurs in both the local and 
remote cache, the candidate file must be transferred to 
the network service. Network latency, throughput, and 
analysis time all affect the user-perceived delay between 
when a file is acquired by the host agent and a threat 
report is returned by the network service. As local net- 
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works usually have low latencies and high bandwidth, 
the analysis time of files will often dominate the network 
latency and throughput delay. The average time for a 
detection engine to analyze a candidate file in the AML 
dataset was approximately 1.3 seconds with a standard 
deviation of 1.8 seconds. 


6.3.3 Forensics Case Studies 


We review two case studies from the deployment 
concerning two real-world events that demonstrate the 
utility of the forensics archive. 


Malware Case Study: While running the host agent in 
transparent mode in the campus deployment, the Clou- 
dAV system alerted us to a candidate executable that had 
been marked as malicious by multiple antivirus engines. 
It is important to note that this malicious file success- 
fully evaded the local antivirus software (McAfee) that 
was installed along side our host agent. Immediately, we 
accessed the management interface to view the forensics 
information associated with the tracked execution event 
and runtime behavioral results provided the two behav- 
ioral engines employed in our network service. 

The initial executable launched by the user was 
warcraft3keygen.exe, an apparent serial number 
generator for the game Warcraft 3. This executable was 
just a bootstrap for the m222.exe executable which was 
written to the Windows temp directory and subsequently 
launched via CreateProcess. m222.exe then copied 
itself to C:\Program Files\Intel\Intel, made itself 
hidden and read-only, and created a fraudulent Windows 
service via the Service Control Manager (SCM) called 
Remote Procedure Call (RPC) MO to launch itself 
automatically at system startup. Additionally, the 
malware attempted to contact command and control 
infrastructure through DNS requests for several names 
including 50216.ipread.com, but the domains had 
already been blackholed. 


Legitimate Case Study: In another instance, we were 
alerted to a candidate executable that was flagged as sus- 
picious by several engines. The executable in question 
was the PsExec utility from SysInternals which allows 
for remote control and command execution. Given that 
this utility can be used for both malicious and legitimate 
purposes, it was worthy of further investigation to deter- 
mine its origin. 

Using the management interface, we were able to im- 
mediately drill down to the affected host, user, files, and 
environment of the suspected event. The PsExec service 
psexesvc.exe was first launched from the parent pro- 
cess services.exe when an incoming remote execution 
request arrived from the PsExec client. The next execu- 
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tion event was net .exe with the command line argument 
localgroup administrators, which results in the list- 
ing of all the users in the local administrators group. 


Three factors led us to dismiss the event as legitimate. 
First, the operation performed by the net command was 
not overtly malicious. Second, the user performing this 
action was a known network administrator. Lastly, we 
were able to determine the net.exe executable was iden- 
tical to the one deployed across all the hosts in the net- 
work, ruling out the case where the net.exe program it- 
self may have been a trojaned version. While this event 
could be seen as a false positive, it is actually an impor- 
tant alert that needs to dealt with by a network adminis- 
trator. The forensic and historical information provided 
through the management interface allows these events to 
be dealt with remotely in an accurate and efficient man- 
ner. 


7 Discussion and Limitations 


Moving detection functionality into the network cloud 
has other technical and practical implications. In this 
section we attempt to highlight limitations of the pro- 
posed model and then describe a few resulting benefits. 


7.1 User Context and Environment in De- 
tection Engines 


One important benefit of running detection engines on 
end systems is that local context such as user input, net- 
work input, operating system state, and the local filesys- 
tem are available to aid detection algorithms. For ex- 
ample, many antivirus vendors use behavioral detection 
routines that monitor running processes to identify mis- 
behaving or potentially malicious programs. 


While it is difficult to replicate the entire state of end 
systems inside the network cloud, there are two general 
techniques an in-cloud antivirus system can use to pro- 
vide additional context to detection engines. First, de- 
tection engines can open or execute files inside a VM 
instance. For example, existing antivirus behavioral de- 
tection system can be leveraged by opening and running 
files inside a virtual antivirus detection instance. A sec- 
ond technique is to replicate more of the local end sys- 
tem state in the cloud. For example, when a file is sent 
to the network service, contextual metadata such as other 
running processes can be attached to the submission and 
used to aid detection. However, because complete local 
state can be quite large, there are many instances where 
deploying local detection agents may be required to com- 
pliment in-cloud detection. 
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7.2 Disconnected Operation 


Another challenge with moving detection into the net- 
work is that network connectivity is needed to analyze 
files. An end host participating in the service may enter 
a disconnected state for many reasons including network 
outages, mobility constraints, misconfiguration, or denial 
of service attacks. In such a disconnected state, the host 
agent may not be able to reach the network service to 
check the remote cache or to submit new files for analy- 
sis. Therefore, in certain scenarios, the end host may be 
unable to complete its desired operations. 

Addressing the issue of disconnected operation is pri- 
marily an issue of policy, although the architecture in- 
cludes technical components that aid in continued pro- 
tection in a disconnected state. For example, the local 
caching employed by our host agent effectively allows 
a disconnected user to access files that have previously 
been analyzed by the network service. However, for files 
that have not yet been analyzed, a policy decision is nec- 
essary. Security-conscious organizations may select a 
strict policy requiring that users have network connec- 
tivity before accessing new applications, while organiza- 
tions with less strict security policies may desire more 
flexibility. As our host agent works together with host- 
based antivirus, local antivirus software installed on the 
end host may provide adequate protection for these en- 
vironments with more liberal security policies until net- 
work access is restored. 


7.3 Sources of Malicious Behavior 


Malicious code or inputs that cause unwanted program 
behavior can be present in many places such as in the 
linking, loading, or running of the initial program in- 
structions, and the reading of input from memory, the 
filesystem, or the network. For example, some types of 
malware use external files such as DLLs loaded at run- 
time to store and later execute malicious code. In addi- 
tion, recent vulnerabilities in desktop software such as 
Adobe Acrobat [1] and Microsoft Word [32] have exem- 
plified the threat from documents, multimedia, and other 
non-executable malcode-bearing file types. Developing 
a host agent that handles all these different sources of 
malicious behavior is challenging. 

The CloudAV implementation described in this paper 
focuses on executables, but the host agent can be ex- 
tended to identify other file types. To explore the chal- 
lenges of extending the system we modified the host 
agent to monitor the DLL dependencies for each exe- 
cutable acquired by the host agent. Each dependent DLL 
of an application is processed similar to the executable it- 
self: the local and remote cache is checked to determine 
if it has been previously analyzed, and if not, it is trans- 
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AV Vendor 3 Months | 1 Month | 1 Week 
Avast +14.8% +16.6% | +24.6% 
AVG +5.9% +6.8% +8.7% 
BitDefender +4.0% +5.3% +3.1% 
ClamAV +0.0% +0.0% +0.0% 
F-Prot +9.9% +15.3% | +12.6% 
F-Secure +7.9% +9.3% +15.0% 
Kaspersky +1.5% +1.9% 42.3% 
McAfee +10.6% +14.0% | +14.2% 
Symantec +17.8% 423.0% | +20.6% 
Trend Micro +9.8% 411.5% | +12.6% 























Table 2: The percentage increase in detection coverage 
obtained when ClamAV, a truly free engine, is added to a 
deployment with only a single engine. 


mitted to the network service for analysis. Extending the 
host agent further to handle documents would be as sim- 
ple as instructing the host agent to listen for filesystem 
events for the desired file types. In fact, the types of files 
acquired by the host agent could be dynamically config- 
ured at a central location by an administrator to adapt to 
evolving threats. 


7.4 Detection Engine Licensing 


Most of the antivirus and behavioral engines employed 
in our architecture required paid licenses. Acquiring li- 
censes for all the engines may be infeasible for some or- 
ganizations. While we have chosen a large number of 
engines for evaluation and measurement purposes, the 
full amount may not be necessary to obtain effective pro- 
tection. As seen in Figure 5, ten engines may not be 
the most effective price/performance point as diminish- 
ing returns are observed as more engines are added. 

We currently employ four free engines in our sys- 
tem for which paid licenses were not necessary: AVG, 
Avast, BitDefender, and ClamAV. Using only these four 
engines, we are still able to obtain 94.3%, 92.0%, and 
88.0% detection coverage over periods of 3 months, 1 
month, and | week respectively. These detection cover- 
age values for the combined free engines exceed every 
single vendor in each dataset period. 

While the interpretation of the various antivirus li- 
censes is unclear in our architecture, especially with re- 
gards to virtualization, it is likely that site-wide licenses 
would be needed for the “free” engines for a commercial 
deployment. Even if only one licensed engine is used, 
our system still maintains the benefits such as forensics 
and management. As an experiment for this scenario, we 
measured how much detection coverage would be gained 
by adding the only truly free (GPL licensed) antivirus 
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product, ClamAV, to an existing system employing only 
a single engine. Although ClamAV is not an especially 
effective engine by itself, it can add a significant amount 
of detection coverage, up to a 25% increase when paired 
with another engine as seen in Table 2. 


7.5 Managing False Positives 


The use of parallel detection engines has important im- 
plications for the management of false positives. While 
multiple detection engines can increase detection cov- 
erage, the number of false positives encountered during 
normal operation may increase when compared to a sin- 
gle engine. While antivirus vendors try hard to reduce 
false positives, they can severely impair productivity and 
take weeks to be corrected by a vendor. 

The proposed architecture provides the ability to ag- 
gregate results from different detection techniques which 
enables the unique ability to trade-off detection coverage 
for false positive resistance. If an administrator wanted 
maximal detection coverage they could set the aggrega- 
tion function to declare a candidate file unsafe if any de- 
tector indicated the file malicious. However, a false pos- 
itive in any of the detector would cause the aggregator to 
declare the file unsafe. 

In contrast, an administrator more concerned about 
false positives may set the aggregation function to de- 
clare a candidate file unsafe if at least half of the detec- 
tors deemed the file malicious. In this way multiple de- 
tection engines can be used to reduce the impact of false 
positives associated with any single engine. 

To explore this trade-off, we collected 12 real-world 
false positives that impact different detectors in Clou- 
dAV. These files range from printer drivers to password 
recovery utilities to self-extracting zip files. We defined 
a threshold, or confidence index, of the number of en- 
gines required to detect a file before deeming it unsafe. 
For each threshold value, we measured the number of 
remaining false positives and also the corresponding de- 
tection rate of true positives. 

The results of this experiment are seen in Table 3. Ata 
threshold of 4 engines, all of the false positives are elim- 
inated while only decreasing the overall detection cover- 
age by less than 4%. As this threshold can be adjusted at 
any time via the management interface, it can set by an 
administrator based on the perceived threat model of the 
network and the actual number of false positives encoun- 
tered during operation. 

A second method of handling false positives is enabled 
by the centralized management of the network service. In 
the case of a standard host-based antivirus deployment, 
encountering a false positive may mean weeks of delay 
and loss of productivity while the antivirus vendor ana- 
lyzes the false positive and releases an updated signature 


USENIX Association 

















Threshold || False Positives | Detection 
1 12 97.7% 
2 5 96.3% 
3 2 95.2% 
4 0 93.9% 




















Table 3: The number of false positives observed at each 
engine threshold and the associated detection coverage 
over the full malware dataset. 


set to all affected clients. In the network-based architec- 
ture, the false positive can be added to a network-wide 
whitelist through the management interface in a matter of 
minutes by a local administrator. This whitelist manage- 
ment allows administrators to alleviate the pain of false 
positives and empowers them to cut out the antivirus ven- 
dor middle-man and make more informed and rapid de- 
cisions about threats on their network. 


7.6 Breaking Free of Vendor Lock-in 


Finally, a serious issue associated with extensive deploy- 
ments of host-based antivirus in a large enterprise or or- 
ganizational network is vendor lock-in. Once a partic- 
ular vendor has been selected through an organization’s 
evaluation process and software is deployed to all depart- 
ments, it is often hard to switch to a new vendor at a later 
point due to technical, management, and bureaucratic is- 
sues. In reality, organizations may wish to switch an- 
tivirus vendors for a number of reasons such as increased 
detection coverage, decreased licensing costs, or integra- 
tion with network management devices. 

The proposed antivirus architecture is innately vendor- 
neutral as it separates the acquisition of candidate files 
on the end host from the actual analysis and detection 
process performed in the network service. Therefore, 
even if only one detection engine is employed in the net- 
work service, a network administrator can easily replace 
it with another vendor’s offering if so desired, without an 
upheaval of existing infrastructure. 


8 Conclusion 


To address the ever-growing sophistication and threat of 
modern malicious software, we have proposed a new 
model for antivirus deployment by providing antivirus 
functionality as a network service using N-version pro- 
tection. This novel paradigm provides significant ad- 
vantages over traditional host-based antivirus including 
better detection of malicious software, enhanced foren- 
sics capabilities, retrospective detection, and improved 
deployability and management. Using a production im- 
plementation and real-world deployment of the CloudAV 
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platform, we evaluated the effectiveness of the proposed 
architecture and demonstrated how it provides signifi- 
cantly greater protection of end hosts against modern 
threats. 

In the future, we plan to investigate the application 
of N-version protection to intrusion detection, phishing, 
and other realms of security that may benefit from het- 
erogeneity. We also plan to open our backend analysis 
infrastructure to security researchers to aid in the detec- 
tion and classification of collected malware samples. 
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Abstract 

The notion of blacklisting communication sources has 
been a well-established defensive measure since the ori- 
gins of the Internet community. In particular, the prac- 
tice of compiling and sharing lists of the worst offenders 
of unwanted traffic is a blacklisting strategy that has re- 
mained virtually unquestioned over many years. But do 
the individuals who incorporate such blacklists into their 
perimeter defenses benefit from the blacklisting contents 
as much as they could from other list-generation strate- 
gies? In this paper, we will argue that there exist better 
alternative blacklist generation strategies that can pro- 
duce higher-quality results for an individual network. 
In particular, we introduce a blacklisting system based 
on a relevance ranking scheme borrowed from the link- 
analysis community. The system produces customized 
blacklists for individuals who choose to contribute data 
to a centralized log-sharing infrastructure. The ranking 
scheme measures how closely related an attack source is 
to a contributor, using that attacker’s history and the con- 
tributor’s recent log production patterns. The blacklisting 
system also integrates substantive log prefiltering and a 
severity metric that captures the degree to which an at- 
tacker’s alert patterns match those of common malware- 
propagation behavior. Our intent is to yield individual- 
ized blacklists that not only produce significantly higher 
hit rates, but that also incorporate source addresses that 
pose the greatest potential threat. We tested our scheme 
on a corpus of over 700 million log entries produced 
from the DShield data center and the result shows that 
our blacklists not only enhance hit counts but also can 
proactively incorporate attacker addresses in a timely 
fashion. An early form of our system have been fielded 
to DShield contributors over the last year. 


1 Introduction 


A network address blacklist represents a collection of 
source IP addresses that have been deemed undesirable, 
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where typically these addresses have been involved in 
some previous illicit activities. For example, DShield (a 
large-scale security-log sharing system) regularly com- 
piles and posts a firewall-parsable blacklist of the most 
prolific attack sources seen by its contributors [17]. With 
more than 1700 contributing sources providing a daily 
stream of 30 million security log entries, such daily 
blacklists provide an informative view of those class C 
subnets that are among the bane of the Internet with re- 
spect to unwanted traffic. We refer to the blacklists that 
are formulated by a large-scale alert repository and con- 
sist of the most prolific sources in the repository’s col- 
lection of data as the global worst offender list (GWOL). 
Another strategy for formulating network address black- 
lists is for an individual network to create a local blacklist 
based entirely on its own history of incoming communi- 
cations. Such lists are often culled from a network’s pri- 
vate firewall log or local IDS alert store, and incorporate 
the most repetitive addresses that appear within the logs. 
We call this blacklist scheme the local worst offender list 
(LWOL) method. 


The GWOL and LWOL strategies have both strengths 
and inherent weaknesses. For example, while GWOLs 
provide networks with important information about 
highly prolific attack sources, they also have the poten- 
tial to exhaust the subscribers’ firewall filter sets with ad- 
dresses that will simply never be encountered. Among 
the sources that do target the subscriber, GWOLs may 
miss a significant number of attacks, in particular when 
the attack sources prefer to choose their targets more 
strategically, focusing on a few known vulnerable net- 
works [4]. Such attackers are not necessarily very pro- 
lific and are hence elusive to GWOLs. The sources on an 
LWOL have repetitively sent unwanted communications 
to the local network and are likely to continue doing so. 
However, LWOLs are limited by being entirely reactive — 
they only capture attackers that have been pounding the 
local network and hence cannot provide a potential for 
the blacklist consumer to learn of attack sources before 
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these sources reach their networks. 

Furthermore, both types of lists suffer from the fact 
that an attack source does not achieve candidacy until it 
has produced a sufficient mass of communications. That 
is, although it is desirable for firewall filters to include 
an attacker’s address before it has saturated the network, 
neither GWOL nor LWOL offer a solution that can pro- 
vide such timely filters. This is a problem particularly 
with GWOL. Even after an attacker has produced signif- 
icant illicit traffic, it may not show up as a prolific source 
within the security log repository, because the data con- 
tributors of the repository are a very small set of networks 
on the Internet. Even repositories such as DShield that 
receive nearly | billion log entries per month represent 
only a small sampling of Internet activity. Significant at- 
tacker sources may elude incorporation into a blacklist 
until they have achieved extensive saturation across the 
Internet. 

In summary, a high-quality blacklist that fortifies net- 
work firewalls should achieve high hit rate, should incor- 
porate addresses in a timely fashion, and should proac- 
tively include addresses even when they have not been 
encountered previously by the blacklist consumer’s net- 
work. Toward this goal, we present a new blacklist gen- 
eration system which we refer to as the highly predictive 
blacklisting (HPB) system. The system incorporates 1) 
an automated log prefiltering phase to remove unreliable 
alert contents, 2) a novel relevance-based attack source 
ranking phase in which attack sources are prioritized on 
a per-contributor basis, and 3) a severity analysis phase 
in which attacker priorities are adjusted to favor attack- 
ers whose alerts mirror known malware propagation pat- 
terns. The system constructs final individualized black- 
lists for each DShield contributor by a weighted fusion 
of the relevance and severity scores. 

HPB’s underlying relevance-based ranking scheme 
represents a significant departure from the long-standing 
LWOL and GWOL strategies. Specifically, the HPB 
scheme examines not just how many targets a source ad- 
dress has attacked, but also which targets it has attacked. 
In the relevance-based ranking phase, each source ad- 
dress is ranked according to how closely related the 
source is to the target blacklist subscriber. This relevance 
measure is based on the attack source similarity patterns 
that are computed across all members of the DShield 
contributor pool (i.e., the amount of attacker overlap ob- 
served between the contributors). Using a data correla- 
tion strategy similar to hyper-text link analysis, such as 
Google’s PageRank [2], the relationships among all the 
contributors are iteratively explored to compute an indi- 
vidual relevance value from each attacker to each con- 
tributor. 

We evaluated our HPB system using more than 720 
million log entries produced by DShield contributors 
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from October to November 2007. We contrast the per- 
formance of the system with that of the corresponding 
GWOLs and LWOLs, using identical time windows, in- 
put data, and blacklist lengths. Our results show that for 
most contributors (more than 80%), our blacklist entries 
exhibit significantly higher hit counts over a multiday 
testing window than both GWOL and LWOL. Further 
experiments show that our scheme can proactively incor- 
porate attacker addresses into the blacklist before these 
addresses reach the blacklist consumer network, and it 
can do so in a timely fashion. Finally, our experiments 
demonstrate that the hit count increase is consistent over 
time, and the advantages of our blacklist remain stable 
across various list lengths and testing windows. 

The contribution of this paper is the introduction of the 
highly predictive blacklisting system, which includes our 
methodology for prefiltering, relevance-based ranking, 
attacker severity ranking, and final blacklist construc- 
tion. Ours is the first exploration of a link-analysis-based 
scheme in the context of security filter production and to 
quantify the predictive quality of the resulting data. The 
HPB system is also one of the only new approaches we 
are aware of for large-scale blacklist publication that has 
been proposed in many years. However, our HPB sys- 
tem is applicable only to those users who participate as 
active contributors to collaborative security log data cen- 
ters. Rather than a detriment, we hope that this fact pro- 
vides some operators a tangible incentive to participate 
in security log contributor pools. Finally, the system dis- 
cussed in this paper, while still a research prototype, has 
been fully implemented and deployed for nearly a year 
as a free service on the Internet at DShield.org. Our ex- 
perience to date leads us to believe that this approach is 
both scalable and feasible for daily use. 

The rest of the paper is organized as follows. Section 2 
provides a background on previous work in blacklist gen- 
eration and related topics. In Section 3 we provide a de- 
tailed description of the Highly Predictive Blacklist sys- 
tem. In Section 4 we present a performance evaluation 
of HPBs, GWOLs, and LWOLS, including assessments 
of the extent to which the above three desired blacklist 
properties (hit rate, proactive appearance, and timely in- 
clusion) are realized by these three blacklists. In Sec- 
tion 5 we present a prototype implementation of the HPB 
system that is freely available to DShield.org log contrib- 
utors, and we summarize our key findings in Section 6. 


2 Related Work 


Network address and email blacklists have been around 
since the early development of the Internet [6]. To- 
day, sites such as DShield regularly compile and pub- 
lish firewall-parsable filters of the most prolific attack 
sources reported to its website [17]. DShield represents 
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a centralized approach to blacklist formulation, provid- 
ing a daily perspective of the malicious background ra- 
diation that plagues the Internet [15,20]. Other recent 
examples of computer and network blacklists include IP 
and DNS blacklists to help networks detect and block 
unwanted web content, SPAM producers, and phishing 
sites, to name a few [7, 8, 17, 18]. The HPB system pre- 
sented here complements, but does not displace these re- 
sources or their blacklisting strategies. In addition, HPBs 
are only applicable to active log contributors (we hope 
as an incentive), not as generically publishable one-size- 
fits-all resources. 

More agile forms of network blacklisting have also 
been explored, with the intention of rapidly publishing 
perimeter filters to control actively spreading malware 
epidemics [1,3, 12, 14]. For example, in [14] a peer- 
to-peer blacklisting scheme is proposed, where each net- 
work incorporates an address into its local blacklist when 
a threshold number of peers have reported attacks from 
this address. We separate our HPB system from these 
malware defense schemes. While the HPB system does 
incorporate a malware-oriented attacker severity metric 
into its final blacklist selection, we have not contem- 
plated nor propose HPBs for use in the context of dy- 
namic quarantine defenses for malware epidemics. 

One key insight that inspired the HPB relevance-based 
ranking scheme was raised by Katti et al. [10], who iden- 
tified the existence of stable correlations among the at- 
tackers reported by security log contributors. Here we in- 
troduce a relevance-based recommendation scheme that 
selects candidate attack sources based on the attacker 
overlaps found among peer contributors. This relevance- 
based ranking scheme can be viewed as a random walk 
on the correlation graph, going from one node to another 
following the edges in the graph with the probability pro- 
portional to the weight of the graph. This form of random 
walk has been applied in link-analysis systems such as 
Google’s PageRank [2], where it is used to estimate the 
probability that a webpage may be visited. Similar link 
analysis has been used to rank movies [13] and reading 
lists [19]. 

The problem of predicting attackers has also been 
recently considered in [24] using a Guassian process 
model. However, [24] purely focused on developing sta- 
tistical learning techniques for attacker prediction based 
on collaborative filtering. In this paper, we present a 
comprehensive blacklisting generation system that con- 
siders many other characteristics of attackers. The pre- 
diction part is only one component in our system. Fur- 
thermore, the prediction model presented here is com- 
pletely different from the one in [24] (Gaussian process 
model in [24] and link analysis model here). By taking 
some penalty in predictive power, the prediction model 
presented here is much more scalable, which is of neces- 
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sity for implementing a deployable service (Section 5). 

Finally, [23] provides a six-page summary of the earli- 
est release of our DShield HPB service, including a high- 
level description of an early ranking scheme. In this pa- 
per we have substantially expanded this algorithm and 
present its full description for the first time. This present 
paper also introduces the integration of metrics to capture 
attack source maliciousness in its final rank selection, 
and presents the full blacklist construction system. We 
also present our quantitative evaluation of multiple sys- 
tem properties, and address several open questions that 
have been raised over the past year since our initial pro- 
totype. 


3 Blacklisting System 


We illustrate our blacklisting system in Figure 1. The 
system constructs blacklists in three stages. First, the se- 
curity alerts supplied by sensors across the Internet are 
preprocessed. This removes known noises in the alert 
collection. We call this the prefiltering stage. The pre- 
processed data are then fed into two parallel engines. 
One ranks, for each contributors, the attack sources ac- 
cording to their relevance to that contributor. The other 
scores the sources using a severity assessment that mea- 
sures their maliciousness. The relevance ranking and the 
severity score are combined at the last stage to generate 
a final blacklist for each contributor. 

We descibe the prefiltering process in Section 3.1, rel- 
evance ranking in Section 3.2, severity score in Sec- 
tion 3.3 and the final production of the blacklists in Sec- 
tion 3.4. 


3.1 Prefiltering Logs for Noise Reduction 


One challenge to producing high-quality threat intelli- 
gence for use in perimeter filtering is that of reducing 
the amount of noise and erroneous data that may exist in 
the input data that drives our blacklist construction algo- 
rithm. That is, in addition to the unwanted port scans, 
sweeps, and intrusion attempts reported daily within the 
DShield log data, there are also commonly produced 
log entries that arise from nonhostile activity, or activ- 
ity from which useful filters cannot be reliably derived. 
While it is not possible to separate attack from nonat- 
tack data, the HPB system prefilters from consideration 
logs that match criteria that we have been able to empiri- 
cally identify as commonly occurring nonuseful input for 
blacklist construction purposes. 

As a preliminary step prior to blacklist construction, 
we apply three filtering techniques to the DShield alert 
logs. First, the HPB system removes from consideration 
DShield logs produced from attack sources from invalid 
or unassigned IP address space. Here we employ the 
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Figure 1: Blacklisting system architecture 


bogon list created by the Cymru team that captures ad- 
dresses that are reserved, not yet allocated, or delegated 
by the Internet Assigned Number Authority [16]. Typi- 
cally, such addresses should not be routed, but otherwise 
do appear anyway in the DShield data. In addition, re- 
served addresses such as the 10.x.x.x or 192.168.x.x may 
also appear in misconfigured contributor logs that are not 
useful for translating into blacklists. 

Second, the system prefilters from consideration net- 
work addresses from Internet measurement services, web 
crawlers, or common software update sources. From ex- 
perience, we have developed a whitelist of highly com- 
mon sources that, while innocuous from an intrusion per- 
spective, often generate alarms in DShield contributor 
logs. 

Finally, the HPB system applies heuristics to avoid 
common false positives that arise from commonly timed- 
out network services. Specifically, we exclude logs pro- 
duced from source ports TCP 53 (DNS), 25 (SMTP), 80 
(HTTP), and 443 (often used for secure web, IMAP, and 
VPN), and from destination ports TCP 53 (DNS) and 
25 (SMTP). Firewalls will commonly time out sessions 
from these services when the server or client becomes 
unresponsive or is slow. In practice, the combination of 
these prefiltering steps provides approximately a 10% re- 
duction in the DShield input stream prior delivery to the 
blacklist generation system. 


3.2 Relevance Ranking 


Our notion of attacker relevance is a measure that in- 
dicates how close the attacker is related to a particu- 
lar blacklist consumer. It also reflects the likelihood to 
which the attacker may come to the blacklist consumer 
in the near future. Note that this relevance is orthogonal 
to metrics that measure the severity (or benignness) of 
the source, which we will discuss in the next section. 

In our context, the blacklist consumers are the contrib- 
utors that supply security logs to a log-sharing repository 
such as DShield. Recent research has observed the exis- 
tence of attacker overlap correlations between DShield 
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contributors [10], i.e., there are pairs of contributors that 
share quite a few common attackers, where the common 
attacker is defined as a source address that both contrib- 
utors have logged and reported to the repository. This re- 
search also found that this attacker overlap phenomenon 
is not due to attacks that select targets randomly (as in a 
random scan case). The correlations are long lived and 
some of them are independent of address proximity. We 
exploit these overlap relationships to measure attacker 
relevance. 

We first illustrate a simple concept of attacker rele- 
vance. Consider a collection of security logs displayed 
in a tabular form as shown in Table 1. We use the rows 
of the table to represent attack sources and the columns 
to represent contributors. We refer to the unique source 
addresses that are reported within the log repository as 
attackers, and use the terms “attacker” and “source” in- 
terchangeably. Since the contributors are also the tar- 
gets of the logged attacks, we refer to them as victims. 
We will use the terms “contributor” and “victim” inter- 
changeably. An asterisk ‘“*’’ in the table cell indicates 
that the corresponding source has reportedly attacked the 
corresponding contributor. 
































UL v2 U3 U4 U5 
$1 
82 
83 * * 
S4 * * 
$5 * 
56 * * 
87 
58 * * 























Table 1: Sample Attack Table 


Let us assume that Table 1 represents a series of logs 
contributed in the recent past by our five victims, vj; 
through vs. Now suppose we would like to calculate 
the relevance of the sources for contributor v; based on 
these attack patterns. From the attack table we observe 
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that contributors v; and v2 share multiple common at- 
tackers. v; also shares one common attack source (s3) 
with v3, but does not share attacker overlap with the other 
contributors. Given this observation, between sources 
$5 and sg, we would say that s5 has more relevance to 
v1 than sg because ss has reportedly attacked v2, which 
has recently experienced multiple attack source overlaps 
with v,. But the victims of sg’s attacks share no overlap 
with v,. Note that this relevance measure is quite differ- 
ent from the measures based on how prolific the attack 
source has been. The latter would favor sg over 55, aS S6 
has attacked more victims than s;. In this sense, which 
contributors a source has attacked is of greater signifi- 
cance to our scheme than how many victims it has at- 
tacked. Similarly, between s5 and 57, s5 is more rele- 
vant, because the victim of s5 (v2) shares more common 
attacks with v; than the victim of s7 (v3). Finally, be- 
cause s4 has attacked both v2 and v3, we would like to 
say that it is the most relevant among s4, $5, Sg, and 57. 


To formalize the above intuition, we model the at- 
tack correlation relationship between contributors us- 
ing a correlation graph, which is a weighted undirected 
graph G = (V, E). The nodes in the graph consist of the 
contributors V = {v 1, v2,...}. There is an edge between 
node vu; and v; if vu; is correlated with v;. The weight on 
the edge is determined by the strength of the correlation 
(i.e., occurrences of attacker overlap) between the two 
corresponding contributors. We now introduce some no- 
tation for the relevance model. 

Let n be the number of nodes (number of contributors) 
in the correlation graph. We use W to denote the adja- 
cency matrix of the correlation graph, where the entry 
W < (i,;) in this matrix is the weight of the edge between 
node v, and v;. For a source s, we denote by T'(s) the set 
of contributors that have reported an attack from s. T’(s) 
can be written in a vector form b* = {b37,b5,..., 63} 
such that b§ = 1if v; € T(s) and 6? = 0 otherwise. 
We also associate with each source s a relevance vector 
r* = {rj,rs,...,7%} such that ré is the relevance value 
of attacker s with respect to contributor v. We use lower- 
case boldface to indicate vectors and uppercase boldface 
to indicate matrices. Table 2 summarizes our notation. 

We now describe how to derive the matrix W from 
the attack reports. Consider the following two cases. In 
Case 1, contributor v; sees attacks from 500 sources and 
vu; sees 10 sources. Five of these sources attack both 1; 
and v;. In Case 2, there are also five common sources. 
However, v; sees only 50 sources and v; sees 10. Al- 
though the number of overlapping sources is the same 
(i.e., 5 common sources), the strength of connection be- 
tween v; and v; is different in the two cases. If a con- 
tributor observes a lot of attacks, it is expected that there 
should be more overlap between this contributor and the 
others. Let m,; be the number of sources seen by v;, m; 
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n # of contributors 

Uj i-th contributor 

W | Adjacency matrix of the correlation 
graph 

Set of contributors that have reported at- 
tack(s) from source s 

b® | Attack vector for source s. b? = 1 if 
vu, € T(s) and 0 otherwise 

Relevance vector for source s. 7} is the 
relevance value of attacker s with re- 
spect to contributor v 














Table 2: Summary of Relevance Model Notations 


the number seen by v;, and m,,; the number of common 
attack sources. The ratio a shows how important v; is 
for v; while oe shows how important v; is for v;. Since 
we want W(;,;) to reflect the strength of the connection 


between v; and v;, we set W(;,;) = mii, Mii One may 








Mi mi . 
view this new W as a Standardized correlation matrix. 
Figure 2 shows the matrix W for Table 1 constructed 
using this method. 


0 0.33 0.083 0 0 
0.33 0 0.063 0 0 
0.083 0.063 0 0.13 0 

0 0 0.13 0 0.5 


0 0 0 0.5 O 


Figure 2: Standardized Correlation Matrix for Attack Ta- 
ble 1 


Given this correlation matrix, we follow the afore- 
mentioned intuition and calculate the relevance as r? = 
jer (s) Wé< i,j). This is to say that if the repository re- 
ports that source s has attacked contributor v,, this fact 
contributes a value of W(;,;) to the source’s relevance 
with respect to the victim v;. Written in vector form, it 
gives us 


r= W-Db*. (1) 


The above simple relevance calculation lacks certain 
desired properties. For example, the simple relevance 
value is calculated solely from the observed activities 
from the source by the repository contributors. In some 
cases, this observation does not represent the complete 
view of the source’s activity. One reason is that the con- 
tributors consist of only a very small set of networks in 
the Internet. Before an attacker saturates the Internet 
with malicious activity, it is often the case that only a 
few contributors have observed the attacker. The attacker 
may be at its early stage or it has attacked many places, 
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most of which do not participate in the security log shar- 
ing system. Therefore, one may want a relevance mea- 
sure that has a “look-ahead” capability. That is, the rele- 
vance calculation should take into consideration possible 
future observations of the source and include these an- 
ticipated observations from the contributors into the rel- 
evance values. 





Figure 3: Relevance Evaluation Considers Possible Fu- 
ture Attacks 


Figure 3 gives an example where one may apply this 
“look-ahead” feature. (Examples here are independent 
of the one shown in Table 1.) The correlation graph of 
Figure 3 consists of four contributors numbered 1, 2, 3, 
and 4. Contributor 2 reported an attack from source s 
(represented by the star). Our goal is to evaluate how 
relevant this attacker is to contributor | (double-circled 
node). Using Equation 1, the relevance would be zero. 
However, we observe that s has relevance 0.5 with re- 
spect to contributor 3 and relevance 0.3 with respect to 
contributor 4. Although at this time, contributors 3 and 
4 have not observed s yet, there may be possible future 
attacks from s. In anticipation of this, when evaluating 
s’s relevance with respect to contributor 1, contributors 
3 and 4 pass to contributor | their relevance values af- 
ter multiplying them with the weights on their edges, re- 
spectively. The attacker’s relevance value for contributor 
1 then is 0.5*0.2+0.3*0.2 = 0.16. Note that, had s actu- 
ally attacked contributors 3 and 4, the contributors would 
have passed the relevance value | (again after multiply- 
ing them with the weights on the edges) to contributor 
1. 

This can be viewed as a relevance propagation process. 
If a contributor v; observed an attacker, we say that the 
attacker has an initial relevance value | for that contribu- 
tor. Following the edges that go out of the contributor, a 
fraction of this relevance can be distributed to the neigh- 
bors of the contributor in the graph. Each of v,’s neigh- 
bors receives a share of relevance that is proportional to 
the weight on the edge that connects the neighbor to 1;. 
Suppose v; is one of the neighbors. A fraction of the rele- 
vance received by v; is then further distributed, in similar 
fashion, to its neighbors. The propagation of relevance 
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continues until the relevance values for each contributor 
reach a stable state. 


This relevance propagation process has another benefit 
besides the “look-ahead” feature. Consider the correla- 
tion graph given in Figure 4 (a). The subgraph formed 
by nodes 1, 2, 3, and 4 is very different from that formed 
by nodes 1, 5, 6, and 7. The subgraph from nodes 1, 2, 
3, and 4 is well connected (in fact it forms a clique). The 
contributors in the subgraph are thus more tied together. 
We call them a correlated group. (We use a dotted cir- 
cle to indicate the correlated group in Figure 4.) There 
may be certain intrinsic similarities between the mem- 
bers in the correlated group (e.g., IP address proximity, 
similar vulnerability). Therefore, it is natural to assign 
more relevance to source addresses that have attacked 
other contributors in the same correlated group. For ex- 
ample, consider the sources s and s’ in Figure 4. They 
both attacked three contributors. All the edges in the cor- 
relation graph have the same weights. (Hence, we omit- 
ted the weights in the figure.) We would like to say that s 
is more relevant than s’ for contributor 1. If we calculate 
the relevance value by Equation 1, the values would be 
the same for the two attackers. Relevance propagation 
helps to give more value to the attacker s because mem- 
bers of the correlated group are well connected. There 
are more paths in the subgraph that lead from the con- 
tributors where the attack happened to the contributor for 
which we are evaluating the attacker relevance. For ex- 
ample, the relevance from contributor 2 can propagate to 
contributor 3 and then to contributor 1. It can also go to 
contributor 4 and then to contributor 1. This is effectively 
the same as having an edge with larger weight between 
the contributors 2 and 1. Therefore, relevance propaga- 
tion can effectively discover and adapt to the structures 
in the correlation graph. The relevance values assigned 
then reflect certain intrinsic relationships among contrib- 
utors. 


We extend Equation 1 to employ relevance propaga- 
tion. If we propagate the relevance values to the imme- 
diate neighbors in the correlation graph, we obtain a rel- 
evance vector W - b® that represents the propagated val- 
ues. Now we propagate the relevance values one more 
hop. This gives us W - W - b’ = W?.-b®. The rele- 
vance vector that reflects the total relevance value each 
contributor receives is then W - b* + W?- b®. If we 
let the propagation process iterate indefinitely, the rele- 
vance vector would become )>>*, W’ - b®. There is a 
technical detail in this process we need to resolve. Nat- 
urally, we would like the relevance value to decay along 
the path of propagation. The further it goes on the graph, 
the smaller its contribution becomes. To achieve this, 
we scale the matrix W by a constant 0 < a < 1 such 
that the 2-norm of the new matrix aW becomes smaller 
than one. With this modification, an attacker will have 
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(a) 


(b) 


Figure 4: Attacks on Members in a Correlated Group Contribute More Relevance 


only a negligible relevance value to contributors that are 
far away in the correlation graph. Putting the above to- 
gether, we compute the relevance vector by the following 


equation: 
co 


r= S“(aw)! he (2) 


i=1 


We observe that b* + r® is the solution for x in the 
following system of linear equations: 


x=b>+aw-x (3) 


The linear system described by Equation 3 is exactly the 
system used by Google’s PageRank [2]. PageRank ana- 
lyzes the link structures of webpages to determine the 
relevance of each webpage with respect to a keyword 
query. In PageRank, b® is set to be an all-one vector and 
W is determined by letting W(;,;) be 1/(# of outgoing 
links on page 7) if one of these outgoing links points 
to webpage i, and W,(;,;) = 0 otherwise. Therefore, 
PageRank propagates relevance where every node pro- 
vides an initial relevance value of one. In our relevance 
calculation, only nodes whose corresponding contribu- 
tors have reported the attacker are assigned one unit of 
initial relevance. Similar to the PageRank values that re- 
flect the link structures of the webpages, our relevance 
values reflect the structure of the correlation graph that 
captures intrinsic relationships among the contributors. 

Equation 3 can be solved to give x = (I— aW)~!- 
b*, where I is the identity matrix. Also, since x = r* + 
b*,r° = (I-aW)7!-b*—b’ = [((I—aW)-!—I]-b®. 
This gives the relevance vector for each attack source. 
The sources are then ranked, for each contributor, ac- 
cording to the relevance values. As each attack source 
has a potentially different relevance value for each con- 
tributor, the rank of a source with respect to different con- 
tributors is different. Note that our concept of relevance 
measure and relevance propagation does not depend on a 
particular choice of the W matrix. As long as W reflects 
the connection weight between the contributors, our rel- 
evance measure applies. 
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3.3. Analyzing Attack Pattern Severity 


We now consider the problem of measuring the degree 
to which each attack source exhibits known patterns of 
malicious behavior. In the next section, we will disuss 
how this measure can be fused into our final blacklist 
construction decisions. In this section we will describe 
our model of malicious behavior and the attributes we 
extract to map each attacker’s log production patterns to 
this model. 

Our model of malicious behavior, in this instance, fo- 
cuses on identifying typical scan-and-infect malicious 
software (or malware). We define our malware behav- 
ior pattern as that of an attacker who conducts an IP 
sweep to small sets of ports that are known to be as- 
sociated with malware propagation or backdoor access. 
This behavior pattern matches the malware behavior pat- 
tern documented by Yegeneswaren et.al. in [20], as 
well as our own most recent experiences (within the last 
twelve months) of more than 20K live malware infec- 
tions observed within our honeynet [21]. Other potential 
malware behavior patterns may be applied, for exam- 
ple, such as the scan-oriented malicious address detec- 
tion schemes outlined in the context of dynamic signa- 
ture generation [11] and malicious port scan analysis [9]. 
Regardless of the malware behavior model used, the de- 
sign and integration of other severity metrics into the fi- 
nal blacklist generation process can be carried out in a 
similar fashion. 

For the set of log entries over the relevance-calculation 
time window, we calculate several attributes for each at- 
tacker’s /24 network address. (Our blacklists are speci- 
fied on a per /24 basis, meaning that a single malicious 
address has the potential to induce a LAN-wide filter. 
This is standard practice for DShield and other black- 
lists.) For each attacker, we assign a score to target ports 
associated with the attacker, assigning a different weight 
depending on whether or not the port is associated with 
known malware communications. 

Let M P be the set of malware-associated ports, where 
we currently uses the definition in Figure 5. This MP 


17th USENIX Security Symposium —113 


114 


53 — UDP 69 - UDP 137 —UDP 
135 — TCP 139 — TCP 445 — TCP 
2082—TCP 2100—TCP 2283—TCP 
3127-—TCP 3128-—TCP 3306—TCP 
6101—TCP 6129-—TCP 8866—TCP 
12345-TCP 11768—TCP 15118—TCP 
4444—TCP 9995-TCP 9996—TCP 
1434 —- UDP 


21—TCP 53 —TCP 42 —TCP 
559 — TCP 1025-—TCP 1433—TCP 
2535-TCP 2745-TCP 2535-—TCP 
3410-—TCP 5000—TCP 5554—TCP 
9898-— TCP 10000—TCP 10080—TCP 
17300 -—TCP 27374—TCP 65506 — TCP 
17300—TCP 3140-—TCP 9033-—TCP 


Figure 5: Malware Associated Ports 


is derived from various AV lists and our honeynet ex- 
periences. We do not argue that this list is complete 
and can be expanded across the life of our HPB service. 
However, our experiences in live malware analysis indi- 
cate that the entries in / P are both highly common and 
highly indicative of malware propagation. 

Let the number of target ports that attacker s connects 
to be c,,,, and the total number of unique ports connected 
to be defined as c,,. We associate a weighting (or impor- 
tance) factor w,, for all ports in MP, and a weighting 
factor w,, for all nonmalware ports. We then compute 
a malware port score (PS) metric for each attacker as 
follows: 


(Wu X Cu) + (Wm X Cm) (4) 
Cu 

Here, we intend w,, to be of greater weight than wy, 
and choose an initial default of w,, = 4* w,,. PS has the 
property that even if a large c,,, is found, if c,, is also large 
(as in a horizontal portscan), then PS will remain small. 
Again, our intention is to promote a malware behavior 
pattern in which malware propagation will tend to target 
fewer specific ports, and is not associated with attackers 
that engage in horizontal port sweeps. 

Next, we compute the set of unique target IP addresses 
connected to by attacker s. We refer to this count as 
TC(s). A large TC represents confirmed IP sweep be- 
havior, which we strongly associate with our malware 
behavior model. TC is the exclusive prioritization met- 
ric used by GWOL, whereas here we consider TC a sec- 
ondary factor to PS in computing a final malware be- 
havior score. We could also include metrics regarding 
the number of DShield sensors (i.e., unique contributor 
IDs) that have reported the attacker, which arguably rep- 
resents the degree of consensus in the contributor pool 
that the attack source is active across the Internet. How- 
ever, the IP sweep pattern is of high interest, even when 
the IP sweep experiences may have been reported only 
by a smaller set of sensors. 

Third, we compute an optional tertiary behavior met- 
ric that captures the ratio of national to international ad- 
dresses that are targeted by attacker s,/R(s). Within 


PS(s) = 
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the DShield repository we find many cases of sources 
(such as from China, Russian, the Czech Republic) that 
exclusively target international victims. However, this 
may also illustrate a weakness in the DShield contributor 
pool, as there may be very few contributors that operate 
sensors within these countries. We incorporate a damp- 
ening factor 6 (0 < 6 < 1) that allows the consumer 
to express the degree to which the JR factor should be 
nullified in computing the final severity score for each 
attacker. 

Finally, we compute a malware severity score M/S(s) 
for each candidate attacker that may appear in the set of 
final blacklist entries: 


MS(s) = PS(s) +log (TC(s)) + dlog (IR(s)) (5) 


The three factors are computed in order of significance 
in mapping to our malware behavior model. Logarithm 
is used because in our model, the secondary metric (T'C) 
and the tertiary metric (JR) are less important than the 
malware port score and we only care about their order of 
magnitude. 


3.4 Blacklist Production 


For each attacker, we now have both its relevance ranking 
and its severity score. We can combine them to generate 
a final blacklist for each contributor. 

For the final blacklist, we would like to include the 
attackers that have strong relevance and discard the non- 
relevant attackers. To generate a final list of length L, we 
use the attacker’s relevance ranking to compile a candi- 
date list of size c- L. (We often set c = 2.) Then, we use 
severity scores of the attackers on the candidate list to ad- 
just its ranking and pick the L highest-ranked attackers 
to form the final list. Intuitively, the adjustment should 
promote the rank of an attacker if the severity assessment 
indicates that it is very malicious. Toward this goal, we 
define a final score that combines the attacker’s relevance 
rank in the candidate list and its severity assessment. In 
particular, let & be the relevance rank of the attacker s 
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(i.e., s is the k-th entry in the candidate list). Recall from 
last section MS(s) is the severity score of s. The final 
score fin(s) is defined to be 


Pai 5 . 6(MS(s)) (6) 


where 


B(x) = S(t erf(= 4) 


where er f(-) is the “S” shaped Gaussian error function. 
We plot ®(«) in Figure 6 with j. = 4 and different d. 
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Figure 6: Phi with different d value 


®(MS‘(s)) promotes the rank of an attacker according 
to its maliciousness. The larger the value of ®(/S(s)) 
is, the more the attacker is moved above comparing to its 
original rank. A ®(1/S(s)) of value 1 would then move 
the attacker above for one half of the size of the final 
list comparing to its original rank. The “S” shaped ®(-) 
transforms the severity assessment /S(s) into a value 
between 0 and 1. The less-malicious attackers often give 
an assessment score below 3. After transformation, they 
will receive only small promotions. On the other hand, 
malicious attackers that give an assessment score above 
7 will be highly promoted. 

To generate the final list, we sort the fin(s) values of 
the attackers in the candidate list and then pick L of them 
that have the smallest fin(s). 


4 Experiment Results 


We created an experimental HPB blacklist formulation 
system. To evaluate the HPBs, we performed a battery 
of experiments using the DShield.org security firewall 
and IDS log repository. We examined a collection of 
more than 720 million log entries produced by DShield 
contributors from October to November 2007. Since our 
relevance measure is based on correlations between con- 
tributors, HPB production is not applicable to contribu- 
tors that have submitted very few reports (DShield has 
contributors that hand-select or sporadically contribute 
logs, providing very few alerts). We therefore exclude 
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those contributors that we find effectively have no corre- 
lation with the wider contributor pool or simply have too 
few alerts to produce meaningful results. For this analy- 
sis, we found that we could compute correlation relation- 
ships for about 700 contributors, or 41% of the DShield 
contributor pool. 

To assess the performance of the HPB system, 
we compare its performance relative to the standard 
DShield-produced GWOL [17]. In addition, we compare 
our HPB performance to that of LWOLs, which we com- 
pute individually for all contributors in our comparison 
set. For the purpose of our comparative assessment, we 
fixed the length of all three competing blacklists to ex- 
actly 1000 entries. However, after we present our com- 
parative performance results, we will then continue our 
investigation by analyzing how the blacklist length af- 
fects the performance of the HPBs. 

In the experiments, we generate GWOL, LWOL, and 
HPBs using data for a certain time period and then test 
the blacklists on data from the time window following 
this period. We call the period used for producing black- 
lists the training window and the period for testing the 
prediction window. In practice, the training period repre- 
sents a snapshot of the most recent history of the repos- 
itory, used to formulate each blacklist for a contributor 
that is then expected to use the blacklist for the length of 
the prediction window. The sizes of these two windows 
are not necessarily equal. We will first describe experi- 
ments that use 5-day lengths for both the training window 
and the prediction window. We then present experiments 
that investigate the effects of the two windows’ lengths 
on HPB quality. 


4.1 Hit Count Improvement 


DShield logs submitted during the prediction window 
are used to determine how many sources included within 
a contributor’s HPB are indeed encountered during that 
prediction window. We call this value the blacklist hit 
count. We view each blacklist address filter not encoun- 
tered by the blacklist consumer as an opportunity cost to 
have prevented the deployment of other filters that could 
have otherwise blocked unwanted traffic. In this sense, 
we view our hit count metric as an important measure 
of the effectiveness of a blacklist formulation algorithm. 
Note that our HPBs are formulated with severity analy- 
sis while the other lists are not. As the severity analysis 
prefers malicious activities, we expect that the hits on the 
HPBs are more malicious. 

To compare the three types of lists, we take 60 days of 
data, divided into twelve 5-day windows. We repeat the 
experiment | | times using the 7-th window as the training 
window and the (i+ 1)-th window as the testing window. 
In the training window, we construct HPB, LWOL, and 
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Window | GWOL total hit | LWOL total hit HPB total hit HPB/GWOL | HPB/LWOL 
1 81937 85141 112009 1.36701 1.31557 
2 83899 74206 115296 1.37422 1.55373 
3 87098 96411 122256 1.40366 1.26807 
4 80849 75127 115715 1.43125 1.54026 
5 87271 88661 118078 1.353 1.33179 
6 93488 73879 122041 1.30542 1.6519 
iE 100209 105374 133421 1.33143 1.26617 
8 96541 91289 126436 1.30966 1.38501 
9 94441 107717 128297 1.35849 1.19106 
10 96702 94813 128753 1.33144 1.35797 
11 97229 108137 131777 1.35533 1.21861 
Average 90879 + 6851 90978 + 13002 | 123098 + 7193 1.36 + 0.04 1.37 + 0.15 


























Table 3: Hit Number Comparison between HPB, LWOL and GWOL 


























Contributor | Average | Median | StdDev Increase 
Percentage | Increase | Increase Range 
Improved vs. GWOL 90% 51 22 89 1 to 732 
Poor vs. GWOL 1% -27 -7 47 -1 to -206 
Improved vs. LWOL 95% 75 36 90 1 to 491 
Poor vs. LWOL 4% -19 -9 28 -1 to -104 

















Table 4: Hit Count Performance, HPB vs. (GWOL and LWOL), Length 1000 Entries 


GWOL. Then the three types of lists are tested on the 
data in the testing window. 

Table 3 shows the total number of hits summed over 
the contributors for HPB, GWOL, and LWOL, respec- 
tively. It also shows the ratio of HPB hits over that of 
GWOL and LWOL. We see that in every window, HPB 
has more hits than GWOL and LWOL. Overall, HPBs 
predict 20-30% more hits than LWOL and GWOL. Note 
that there are quite large variances among the number of 
hits between time windows. Most of the variances, how- 
ever, are not from our blacklist construction, rather they 
are from the variance among the number of attackers the 
networks experience in different testing windows. 


























Increase | Increase | Increase Increase 
Average | Median StdDev Range 
vs. GWOL 129 78 124 40 to 732 
vs. LWOL 183 188 93 59 to 491 | 





Table 5: Top 200 Contributors’ Hit Count Increases 
(Blacklist Length 1000) 


The results in Table 3 show HPB’s hit improvement 
over time windows. We now investigate the distribution 
of the HPB’s hit improvement across contributors in one 
time window. We use two quantities for comparison. The 
first is the hit count improvement, which is simply the 
HPB hit count minus the hit count of the other list. The 
second comparative measure we used is the relative hit 
count improvement (RJ), which is the ratio in percentage 
of the HPB hit count increase over the other blacklist hit 
count. If the other list hit count is zero we define RI to be 
100x the HPB hit count, and if both hit counts are zero 
we set RI to 100. 
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Table 5 provides a summary of hit-count improvement 
for the 200 contributors where HPBs perform the best. 
The hit-count results for all the contributors are summa- 
rized in Table 4. 


Figure 7 compares HPB to GWOL. The left panel of 
the figure plots the histogram showing the distribution of 
the hit improvement across the contributors. The x-axis 
indicates improvements, and the hight of the bars repre- 
sents the number of contributors whose improvement fall 
in the corresponding bin. Bars left to x = O represent 
contributors for whom the HPB has worse performance 
and bars on the right represent contributors for whom 
HPBs performed better. For most contributors, the im- 
provment is positive. The largest improvement reaches 
732. For only a few contributors, HPB performs worse 
in this time window. 


The panel on the right of Figure 7 plots the RI (ratio % 
of HPB’s hit count increase over GWOL’s hit count) dis- 
tribution. We sort the RI values and plot them against the 
contributors. We label the x-axis by cummulative per- 
centage, i.e., a tick on x-axis represents the percentage 
of contributors that lie to the left of the tick. For exam- 
ple, the tick 20 means 20 percent of the contributors lie 
left to this tick. There are contributors for which the RI 
value can be more than 3900. Instead of showing such 
large RI values, we cut off the plot at RI value 300. From 
the plot, we see that there are about 20% of contributors 
for which the HPBs achieve an RI more than 100, ie., 
the HPB at least doubled the GWOL hit count. For about 
half of the contributors, the HPBs have about 25% more 
hits (an RI of 25). The HPBs have more hits than GWOL 
for almost 90% of the contributors. Only for a few con- 
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Figure 7: Hit Count Comparison of HPB and GWOL: Length 1000 Entries 
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Figure 8: Hit Count Comparison of HPB and LWOL: Length 1000 Entries 


tributors (about 7%), HPBs perform worse. (We discuss 
the reasons why HPB may perform worse in Section 4.4.) 

Figure 8 compares HPB hit counts to those of LWOL. 
The data are plotted in the same way as in Figure 7. 
Overall, HPBs demonstrate a performance advantage 
over LWOL. The IV and RI values also exhibit similar 
distributions. However, comparing Figures 8 and 7, we 
see that HPB has more hit improvement comparing to 
LWOL than to GWOL in this time window. 


4.2 Prediction of New Attacks 


One clear motivating assumption in secure collaborative 
defense strategies is that participants have the potential 
to prepare themselves from attacks that they have not yet 
encountered. We will say that a new attack occurs when 
a contributor produces a DShield log entry from a source 
that this contributor has never before reported. In this ex- 
periment, we show that HPB analysis provides contribu- 
tors a potential to predict more new attacks than GWOL. 
(LWOL is not considered, since by definition it includes 
only attackers that are actively hitting the LWOL owner.) 
For each contributor, we construct two new HPB and 
GWOL lists with equal length of 1000 entries, such that 
no entries have been reported by the contributor during 
our training window. We call these lists HPB-local (HPB 
minus local) and GWOL-local (GWOL minus local), re- 
spectively. Figure 9 compares HPB-local and GWOL- 
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local on their ability to predict on new attack sources 
for the local contributor. These hit number plots demon- 
strate that HPB-local provides substantial improvement 
over the predictive value of GWOL. 


4.3 Timely Inclusion of Sources 


By timely inclusion, we refer to the ability of a blacklist 
to incorporate addresses relevant to the blacklist owner 
before those addresses have saturated the Internet. To in- 
vestigate the timeliness of the GWOL, LWOL, and the 
HPB we examine how many contributors need to report 
a particular attacker before it can be included into the re- 
spective blacklists. We focus our attention on the set of 
attackers within these blacklists that did carry out attacks 
during the prediction window. And we use the number 
of distinct victims (contributors) that a source attacked 
in the training window to measure the extent to which 
the source has saturated the Internet. Figure 10 plots 
the distribution of the number of distinct victims across 
different attackers on the three blacklists. As expected, 
the attackers that get selected on the GWOL were the 
most prolific in the training period. In particular, all the 
sources on the GWOL have attacked more than 20 con- 
tributors and almost 1/3 of them attacked more than 200 
contributors. To some extent, these attackers have satu- 
rated the Internet with their activities. (DShield sensors 
are a very small sample of the Internet. A random at- 
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Figure 9: HPB-local Predicts More New Attacks Than GWOL-local 


tacker has to target many places to be picked up by the 
sensors.) The LWOLs select attacker addresses that fo- 
cused on the local networks. Most of these addresses 
had attacked far fewer contributors. HPBs’s distribution 
is close to that of the LWOL, hence allowing the incor- 
poration of attackers that have not saturated the Internet. 
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Figure 10: Cumulative Distribution of Distinct Victim 
Numbers 


4.4 Performance Consistency 


The results in the above experiments show that the HPB 
provides an increase in hit count performance across the 
majority of all contributors. We now ask the follow- 
ing question: is the HPB’s performance consistent for 
a given contributor over time? In this experiment, we 
investigate this consistency question. 

We use a 60-day DShield dataset. We divide it into 
12 time windows, To, 71,...,Z11. We generate black- 
lists from data in time window T;,_; and test the lists on 
data in T;. For each contributor v, we compare HPB with 
GWOL and obtain eleven improvement values for win- 
dow To to 719. We denote them 
IVs(v) = {IVo(v), [V2(v),... [Vio(v)}. We then de- 
fine a consistency index (CI) for each contributor. If 
IV;(v) > 0, we say that the HPB performs well for v 
in window 7. Otherwise, we say that the HPB performs 
worse. CI is the difference between the number of win- 
dows in which HPB performs well and the ones in which 
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HPB performs poorly, i.e., CI(v) = \{p € IVs(v) : 
p > Of} — |{p € IVs(v) : p < O}|. If HPB con- 
sistently performs better than GWOL for a contributor, 
its CI(v) should be close to 11. If it consistently per- 
forms worse, the CJ value will be close to -11. How- 
ever, if the HPB performance flip-flops, its CI value will 
be close to zero. Figure 11 plots the sorted CI values 
against the contributors. (Again, we label the x-axis by 
cummulative percentage.) We see that for almost 70% of 
the contributors, HPB’s performance is extremely con- 
sistent. They all have a CI value of 11, meaning for the 
eleven time windows, the HPB always predicts more hits 
for them than GWOL. For more than 90% of the contrib- 
utors, HPBs demonstrate fairly good consistency. With 
few contributors does the performance switch back and 
forth. Only 5 contributors show performance index be- 
low -3. 
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Figure 11: Cumulative Distribution of Consistency Index 


The consistency investigation sheds some light on the 
reason why there is a small percentage of contributors 
for which the HPBs (sometimes) perform worse than the 
other list. HPB construction is based on the relevance 
measure. The relevance relates attack sources to contrib- 
utors according to the past security logs collected by the 
repository. If a contributor has relatively stable correla- 
tions (stable for several days) with other contributors or it 
experiences stable attack patterns, the relevance measure 
can capture this and thus produce blacklists with more 
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hits. Such HPBs will also be consistent in hit-count per- 
formance. On the other hand, if the correlation is not 
stable or the attacks exhibit few patterns, the relevance 
measure will be less effective and may produce black- 
lists with fewer hits. Such HPBs will not be consistent 
in performance because sometimes they may guess right 
and produce more hits and sometimes they may guess 
wrong. 

This can be seen in Figure 11. All the consistent HPBs 
have CI value 11. These HPBs have both consistency 
and better hit-count performance. There is no HPB that 
shows CI value -11. HPB never performs consistently 
worse. 

This is particularly useful because the consistency of 
an HPB’s performance can be used to indicate whether 
the HPB user (the contributor) has stable correlations. If 
so, HPBs can be better blacklists to use. The experiment 
result suggests that most of the contributors have stable 
correlations. In practice, given a few cycles of computing 
HPB and GWOL for a DShield contributor, we can pro- 
vide an informed recommendation as to which list that 
contributor should adopt over a longer term. 


4.5 Blacklist Length 


In this experiment, we vary the length of the blacklists to 
be 500, 1000, 5000 and 10000. We then compare the hit 
counts of HPBs, GWOLs, and LWOLs. Because in all 
the experiments, the improvements for different contrib- 
utors display similar distributions, we will simply plot 
the medians of the hit rates of these respective blacklists. 
(Hit rate is the hit count divided by the blacklist length.) 
Our results are illustrated in Figure 12, and show that 
HPBs have the hit rate advantage for all these choices in 
blacklist length. The relative amount of advantage is also 
maintained across different lengths. 
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Figure 12: Hit Rates of HPB, GWOL, and LWOL with 
Different Lengths 


Although the hit rate for the shorter lists is higher, the 
number of hits are larger for the longer lists. This is so 
for all three types of blacklists. It shows that the longer 
the list is, the more entries on the list are wasted (in the 


USENIX Association 


sense that they do not get hit). Therefore, it may not 
always be desirable to use very long lists. 


4.6 Training and Prediction Window Sizes 


We now investigate how far into the future the HPB can 
maintain its advantage over GWOL and LWOL, and how 
different training window sizes affect an HPB’s hit count. 
The former helps to determine how often we need to re- 
compute the blacklist, and the latter helps to select the 
right amount of history data as the input to our system. 
The left panel of Figure 13 shows the median of the hit 
count of HPB, GWOL, and LWOL on day 1, 2,3,...,20 
for each individual day in the prediction window. All 
lists are generated using data from a 5-day window prior 
to the prediction window. For all blacklists, the num- 
ber of hits decreases along time. The HPB maintains an 
advantage over the entire duration of the prediction win- 
dow. From this plot, we also see that the blacklists need 
to be refreshed frequently. In particular, there may be an 
almost 30% hit drop when the HPB is more than a week 
old. 

The right panel of Figure 13 plots hit-number medians 
for four HPBs. These HPBs are generated in a slightly 
different way from the HPBs we used so far. In previ- 
ous experiments, to generate an HPB, we produce the 
correlation matrix from a set of attack reports. Then the 
sources in the same set of reports are selected into HPBs 
based on their relevance. In this experiment, we con- 
struct the correlation matrix using reports from training 
windows of size 2, 5, 7, and 10 days. Then the sources 
that are in the reports within the 5-day window right be- 
fore the prediction (test) window are picked based on 
their relevance. In this formulation, we exclude sources 
that appear only in reports from distant history; we view 
their extended silence to represent a significant loss in 
relevance. The remainder of the test is performed in the 
same way as the previous experiments, i.e., the hit counts 
are obtained in the following 5-day prediction window. 
The experiment result shows that there is a slight in- 
crease in the hit counts going from a 2-day training win- 
dow to a 5-day training window. The hit counts then 
remain roughly the same for the other training-window 
size. This indicates that for most of the contributors, the 
correlation matrix can be quite stable over time. 


5 An Example Blacklisting Service 


In mid 2007, we deployed an initial prototype imple- 
mentation of the HPB system, providing a subset of the 
features described in this paper. This initial deploy- 
ment was packaged as a free Internet blacklisting ser- 
vice for DShield log contributors [22,23]. HPB blacklists 
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Figure 13: Effect of Training Window and Prediction Window Size on HPB’s hit count 


are constructed for all contributors daily, and each con- 
tributor can download her individual HPB through her 
DShield website account. To date, we have had a rela- 
tive small pool of HPB downloaders (roughly 70 users 
over the most 3 months). We now describe several as- 
pects of fielding a practical and scalable implementation 
of an HPB system based on our initial deployment expe- 
riences. We present an assessment of the algorithm com- 
plexity, the DShield service implementation, and discuss 
some open questions raised from the open release of our 
service. 


5.1 Algorithm Complexity 


Because HPBs are constructed from a relatively high- 
volume corpus of security logs, our system must be pre- 
pared to process well over 100M log entries per day 
to process entries over the current 5-day training win- 
dow. The bottleneck of the system is the relevance rank- 
ing. Therefore, our complexity discussion focuses on the 
ranking algorithm. There is always an amount of com- 
plexity that is linear to the size of the alert data. That is, 
let N(data) be the number of alerts in the data collec- 
tion; we have a minimum complexity of O(N (data)). 
Our discussion will focus on other complexities incurred 
by the algorithm besides this linear-time requirement. 

We denote by N(s) and N(v) the number of sources 
in the data collection and the number of contributors to 
the repository respectively. In practice, one can expect 
N(v) to be in the order of thousands while N(s) is much 
larger, typically in the tens of millions. We obtain W 
and b® by going through the repository and doing simple 
accounting. The adjacency matrix W requires the most 
work to construct. To obtain this matrix, we record ev- 
ery overlapped attack while going through the alert data 
and then perform standardization. The latter steps re- 
quire us to go through the whole matrix, which results in 
O(N(v)?) complexity. 

Besides going through the data, the most time- 
consuming step in the relevance estimate process is the 
computation that solves the linear equations in Equa- 
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tion 3. At first glance, because for each source s, we 
have a linear system determined by Equation 3, it seems 
that we need to solve N(s) linear systems. This can 
be expensive as N(s) is very large. Further investi- 
gation shows that while b® is different per source s, 
the (I — W)~! part of the solution to Equation 3 is 
the same for all s. Therefore, we need to compute it 
only once, which requires O(N (v)?) time by brute force 
or O(N(v)?:3”°) using more sophisticated methods [5]. 
Because b® is sparse, once we have (I — W)~, the to- 
tal time to obtain the ranking scores for all the sources 
and all the contributors is O(N (v) - N(data)). Assum- 
ing N(v)? is much smaller than N (data), the total com- 
plexity to make relevance ranking is O(N (v)- N(data)). 
For a data set that contains a billion records contributed 
by a thousand sensors, generating a thousand rankings 
requires only several trillion operations (additions and 
multiplications). This can be easily handled by modern 
computers. In fact, in our experiments, with N(data) 
in the high tens of millions and N(v) on the order of 
one thousand, it takes less than 30 minutes to generate 
all contributor blacklists on an Intel Xeon 3.6 GHz ma- 
chine. 


5.2. The DShield Implementation 


The pragmatics of deploying an HPB service through the 
DShield website are straightforward. DShield log con- 
tributors are already provided private web accounts in 
order to review their reports. However, to ease the auto- 
matic retrieval of HPBs, users are not required to log in 
via DShield’s standard web account procedure. Instead, 
contributors wishing to access their individual HPBs can 
create account-specific hexadecimal tokens, and can then 
append this token to the HPB URL. This token has a 
number of advantages, particularly for developing and 
maintaining automated HPB retrieval scripts. That is, a 
user account password may be changed regularly, but the 
retrieval token (and script) will remain unaffected. 

To provide further protection of the integrity and con- 
fidentiality of an HPB the user may also pull the HPB via 
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DShield Customized Blocklist 
created 2007-01-19 12:13:14 UTC 
for userid 11111 
some rights reserved, DShield Inc., 
elie ed 
o2e2e2 

End of list 


255..255.255.0 
255.255.255.0 


test network 
another test. 


HM PH HH HOF 


Creative Commons Share Alike License 
License and Usage Info: http://www.dshield.org/blocklist.html 


This network does not exist 


Figure 14: A Sample Blocklist from DShield Implementation 


https. A detached PGP signature can be retrieved in case 
https is not available or not considered a sufficient proof 
of authenticity. 

HPBs are distributed using a simple tab-delimited for- 
mat. The first column identifies the network address. 
The second column provides the netmask. Additional 
columns are used to provide more information about the 
respective offender, such as the name of the network and 
country of origin (or type of attacks seen). These ad- 
ditional columns are intended for human review of the 
HPB. Comments may be added to the blocklist. All com- 
ments start with a # mark. A sample blocklist is shown 
in Figure 14. 


5.3. Gaming the System 


As we have made efforts to implement, test, and adver- 
tise early versions of the HPB system, several open ques- 
tions have been raised regarding the ability of adversaries 
to game the HPB system. That is, can an attacker con- 
tribute data to DShield with the intention of manipulating 
HPB production in ways that negatively harm HPB qual- 
ity? Let us consider several questions that arise from the 
fact that HPBs are derived from volunteer sources, which 
may include dishonest contributors that are actively try- 
ing to harm or negatively manipulate HPB results. 

Can an attacker cause a consumer to incorporate an 
unsuspecting victim address into a third party’s HPB? 
Let us assume that attacker A participates as one or more 
DShield contributors (A might register multiple IDs) and 
knows that consumer C is also a DShield contributor and 
an active HPB user. Furthermore, A would like to cause 
address B to be inserted into consumer C’s HPB. There 
are two potential strategies A can pursue to achieve this 
goal. First, A can spoof attacks as address B, directing 
these attacks to other contributors that are highly corre- 
lated with A. However, C’s correlated contributor set is 
neither readily available to A (unless A is a DShield in- 
sider) or necessarily stable over time. More plausibly, A 
could artificially cause his own contributor IDs to report 
the same attacks as C’. He can do this by attacking C' with 
a set of spoofed addresses, and then reporting similarly 
spoofed logs from his contributor IDs. Once a sufficient 
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set of attack logs with identical spoofed attackers is re- 
ported by Cand A, C could then positively influence the 
likelihood that address B will be inserted into A’s HPB. 
While this is a possible threat, we also observe that simi- 
lar attacks can be launched against GWOL and more triv- 
ially against LWOL. Furthermore, in the case of GWOL, 
B will be inserted in all consumers’ GWOLs, whereas 
A must launch this attack individually against each HPB 
consumer. 

Can an attacker cause his own address to be excluded 
from a specific third-party HPB? Let us assume that A 
would like to guarantee that address B will not appear in 
C’s HPB. This is very difficult for A to guarantee. While 
A may cause artificial alignment between his and C’s 
logs using the alert spoofing method discussed above, A 
cannot control what other addresses may also align with 
C. If B attacks other contributors that are aligned with 
C, B has the potential to enter C’s HPB. 

Can an attacker fully prevent or poison all HPB pro- 
duction? In short, yes. Data poisoning is a fundamental 
threat that arises in all volunteer contributor-based data 
centers, and is an inherently difficult threat to overcome. 
However, DShield does occasionally experience, and in- 
corporate countermeasures for issues such as accidental 
flooding and sensor misconfiguration. DDoS threats also 
arise and are dealt with by DShield case by case. 

HPB generation could also be specifically targeted by 
a malicious contributor that attempts to artificially inflate 
the number of attacker or victim addresses, which will in- 
crease the values of s or v, as described in our complexity 
analysis, Section 5.1. However, to sufficiently prohibit 
HPB production, the contributor would necessarily pro- 
duce highly anomalous volumes of attackers (or sources) 
that would likely allow us to identify and (temporarily) 
filter this contributor. 


6 Conclusion 


In this paper, we introduced a new system to generate 
blacklists for contributors to a large-scale security-log 
sharing infrastructure. The system employs a link anal- 
ysis method similar to Google’s PageRank for black- 
list formulation. It also integrates substantive log pre- 
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filtering and a severity metric that captures the degree to 
which an attacker’s alert patterns match those of com- 
mon malware-propagation behavior. Experimenting on 
a large corpus of real DShield data, we demonstrate that 
our blacklists have higher attacker hit rates, better new 
attacker prediction quality, and long-term performance 
stability. 

In April of 2007, we released a highly predictive 
blacklist service at DShield.org. We view this service 
as a first experimental step toward a new direction of 
high-quality blacklist generation. We also believe that 
this service offers a new argument to help motivate the 
field of secure collaborative data sharing. In particular, 
it demonstrates that people who collaborate in blacklist 
formulation can share a greater understanding of attack 
source histories, and thereby derive more informed filter- 
ing policies. As future work, we will continue to evolve 
the HPB blacklisting system as our experience grows 
through managing the blacklist service. 
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Abstract— Large-scale bandwidth-based distributed 
denial-of-service (DDoS) attacks can quickly knock out 
substantial parts of a network before reactive defenses 
can respond. Even traffic flows that are not under direct 
attack can suffer significant collateral damage if these 
flows pass through links that are common to attack 
routes. Given the existence today of large botnets with 
more than a hundred thousand bots, the potential for 
a large-scale coordinated attack exists, especially given 
the prevalence of high-speed Internet access. This paper 
presents a Proactive Surge Protection (PSP) mechanism 
that aims to provide a broad first line of defense against 
DDoS attacks. The approach aims to minimize collateral 
damage by providing bandwidth isolation between traffic 
flows. This isolation is achieved through a combination 
of traffic measurements, bandwidth allocation of network 
resources, metering and tagging of packets at the network 
perimeter, and preferential dropping of packets inside 
the network. The proposed solution is readily deployable 
using existing router mechanisms and does not rely on 
any unauthenticated packet header information. Thus 
the approach is resilient to evading attack schemes that 
launch many seemingly legitimate TCP connections with 
spoofed IP addresses and port numbers. Finally, our 
extensive evaluation results across two large commercial 
backbone networks, using both distributed and targeted 
attack scenarios, show that up to 95.5% of the network 
could suffer collateral damage without protection, but 
our solution was able to significantly reduce the amount 
of collateral damage by up to 97.58% in terms of the 
number of packets dropped and 90.36% in terms of the 
number of flows with packet loss. Furthermore, we show 
that PSP can maintain low packet loss rates even when 
the intensity of attacks is increased significantly. 


I. INTRODUCTION 


A coordinated attack can potentially disable a network 
by flooding it with traffic. Such attacks are also known 
as bandwidth-based distributed denial-of-service (DDoS) 
attacks and are the focus of our work. Depending on 
the operator, the provider network may be a small-to- 
medium regional network or a large core network. For 
small-to-medium size regional networks, this type of 
bandwidth-based attacks has certainly disrupted service 
in the past. For core networks with huge capacities, one 
might argue that such an attack risk is remote. However, 
as reported in the media [6], large botnets already exist 
in the Internet today. These large botnets combined with 
the prevalence of high speed Internet access can quite 
easily give attackers multiple tens of Gb/s of attack 
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capacity. Moreover, core networks are oversubscribed. 
For example, in the Abilene network [1], some of the 
core routers have an incoming capacity of larger than 
30 Gb/s from the access networks, but only 20 Gb/s 
of outgoing capacity to the core. Although commercial 
ISPs do not publish their oversubscription levels, they 
are generally substantially higher than the ones found 
in the Abilene network due to commercial pressures of 
maximizing return on investments. 

Considering these insights, one might wonder why 
we have not seen multiple successful bandwidth-based 
attacks to large core networks in the past. The answer 
to this question is difficult to assess. Partially, attacks 
might not be occurring because the organizations which 
control the botnets are interested in making money by 
distributing SPAM, committing click frauds, or extorting 
money from mid-sized websites. Therefore, they would 
have no commercial interest in disrupting the Internet as 
a whole. Another reason might be that network operators 
are closely monitoring their traffic and actively trying 
to intervene. Nonetheless, recent history has shown that 
if such an attack possibility exists, it will eventually 
be exploited. For example, SYN flooding attacks were 
described in [3] years before such attacks were used to 
disrupt servers in the Internet. 

To defend against large bandwidth-based DDoS at- 
tacks, a number of defense mechanisms currently exist, 
but many are reactive in nature (i.e., they can only 
respond after an attack has been identified in an effort 
to limit the damage). However, the onset of large- 
scale bandwidth-based attacks can occur almost instan- 
taneously, causing potentially a huge surge in traffic that 
can effectively knock out substantial parts of a network 
before reactive defense mechanisms have a chance to 
respond. To provide a broad first line of defense against 
DDoS attacks when they happen, we propose a new 
protection mechanism called Proactive Surge Protection 
(PSP). In particular, under a flooding attack, traffic 
loads along attack routes will exceed link capacities, 
causing packets to be dropped indiscriminately. Without 
proactive protection, even for traffic flows that are not 
under direct attack, substantial packet loss will occur if 
these flows pass through links that are common to attack 
routes, resulting in significant collateral damage. The 
PSP solution is based on providing bandwidth isolation 
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between traffic flows so that the collateral damage to 
traffic flows not under direct attack is substantially 
reduced. 

This bandwidth isolation is achieved through a com- 
bination of traffic data collection, bandwidth allocation 
of network capacity based on traffic measurements, me- 
tering and tagging of packets at the network perimeter 
into two differentiated priority classes based on capacity 
allocation, and preferential dropping of packets in the 
network when link capacities are exceeded. It is im- 
portant to note that PSP has no impact on the regular 
operation of the network if no link is overloaded. It 
therefore introduces no penalty in the common case. 
In addition, PSP is deployable using existing router 
mechanisms that are already available in modern routers, 
which makes our approach scalable, feasible, and cost 
effective. Further, PSP is resilient to IP spoofing as well 
as changes in the underlying traffic characteristics such 
as the number of TCP connections. This is due to the 
fact that we focus on protecting traffic between different 
ingress-egress interface pairs in a provider network and 
both the ingress and egress interface of an IP datagram 
can be directly determined by the network operator. 
Therefore, the network operator does not have to rely 
on unauthenticated information such as a source or 
destination IP address to tag a packet. 

The work presented in this paper substantially extends 
a preliminary version of our work that was initially 
presented at a workshop [10]. In particular, we propose 
a new bandwidth allocation algorithm called CDF-PSP 
that takes into consideration the traffic variability ob- 
served in historical traffic measurements. CDF-PSP aims 
to maximize in a max-min fair manner the acceptance 
probability (or equivalently the min-max minimization 
of the drop probability) of packets by using the cu- 
mulative distribution function over historical data sets 
as the objective function. By taking into consideration 
the traffic variability, we show that the effectiveness of 
our protection mechanism can be significantly improved. 
In addition, we have also substantially extended our 
preliminary work with much more extensive in-depth 
evaluation of our proposed PSP mechanism using de- 
tailed trace-driven simulations. 

To test the robustness of our proposed approach, 
we evaluated the PSP mechanism using both highly 
distributed attack scenarios involving a high percentage 
of ingress and egress routers, as well as targeted attack 
scenarios in which the attacks are concentrated to a small 
number of egress destinations. Our extensive evaluations 
across two large commercial backbone networks show 
that up to 95.5% of the network could suffer collateral 
damage without protection, and our solution was able to 
significantly reduce the amount of collateral damage by 
up to 97.58% in terms of the number of packets dropped 
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and up to 90.36% in terms of the number of flows with 
packet loss. 

In comparison to our preliminary work, the perfor- 
mance of our new algorithm was able to achieve a 
relative reduction of up to 53.09% in terms of the 
number of packets dropped and up to 59.30% in terms 
of the number of flows with packet loss. In addition, we 
show that PSP can maintain low packet loss rates even 
when the intensity of attacks is increased significantly. 
Beyond evaluating extensively the impact of our protec- 
tion scheme on packet drops, we also present detailed 
analysis on the impact of our scheme at the level of flow 
aggregates between individual ingress-egress interface 
pairs in the network. 

The rest of this paper is organized as follows. Sec- 
tion II outlines related work. Section III presents a high- 
level overview of our proposed PSP approach. Section IV 
describes in greater details the central component of our 
proposed architecture that deals with bandwidth alloca- 
tion policies. Section V describes our experimental setup, 
and Section VI presents extensive evaluation of our 
proposed solutions across two large backbone networks. 
Section VII concludes the paper. 


II. RELATED WORK 


DDoS protection has received considerable attention 
in the literature. The oldest approach, still heavily in 
use today, is typically based on coarse-grain traffic 
anomalies detection [21], [2]. Traceback techniques [32], 
[27], [28] are then used to identify the true attack 
source, which could be disguised by IP spoofing. Af- 
ter detecting the true source of the DDoS traffic the 
network operator can block the DDoS traffic on its 
ingress interfaces by configuring access control lists or 
by using DDoS scrubbing devices such as [4]. Although 
these approaches are practical, they do not allow for an 
instantaneous protection of the network. As implemented 
today, theses approaches require multiple minutes to 
detect and mitigate DDoS attacks, which does not match 
the time sensitivity of today’s applications. Similarly, 
network management mechanisms that generally aim to 
find alternate routes around congested links also do not 
operate on a time scale that matches the time sensitivity 
of today’s applications. 

More recently, the research community has focused 
on enhancing the current Internet protocol and routing 
implementations. For example, multiple proposals have 
suggested to limit the best effort connectivity of the net- 
work using techniques such as capabilities models [24], 
[33], proof-of-work schemes [19], filtering schemes [20] 
or default-off communication models [7]. The main 
focus of these papers is the protection of customers 
connecting to the core network rather than protecting 
the core itself, which is the focus of our work. To 


USENIX Association 


illustrate the difference, consider a scenario in which 
an attacker controls a large number of zombies. These 
zombies could communicate with each other, granting 
each other capabilities or similar rights to communicate. 
If planned properly, this traffic is still sufficient to attack 
a core network. The root of the problem is that the core 
cannot trust either the sender or the receiver of the traffic 
to protect itself. 

Several proactive solutions have been proposed. One 
solution was presented in [30]. Similar to the proposals 
limiting connectivity cited above, it focuses on protecting 
individual customers. This leads again to a trust issue in 
that a service provider should not trust its customers for 
protection. Furthermore, their solution relies heavily on 
the operator and customers knowing a priori who are 
the good and bad network entities, and their solution 
has a scalability issue in that it is not scalable to 
maintain detailed per-customer state for all customers 
within the network. Router-based defense mechanisms 
have also been proposed as a way to mitigate bandwidth- 
based attacks. They generally operate either on traffic 
aggregates [17] or on individual flows [22]. However, 
as shown in [31], these router-based mechanisms can be 
defeated in several ways. Moreover, deploying router- 
based defense mechanisms like pushback at every router 
can be challenging. 

Our work builds on the existing body of literature on 
max-min fair resource allocation [8], [29], [16], [9], [25], 
[26], [23] to the problem of proactive DDoS defense. 
However, our work here is different in that we use 
max-min fair allocation for the purpose of differential 
tagging of packets with the objective of minimizing 
collateral damage when a DDoS attack occurs. Our 
work here is also different than the server-centric DDoS 
defense mechanism proposed in [34], which is aimed 
at protecting end-hosts rather than the network. In their 
solution, a server explicitly negotiates with selected 
upstream routers to throttle traffic destined to it. Max- 
min fairness is applied to set the throttling rates of these 
selected upstream routers. Like [30] discussed above, 
their solution also has a scalability issue in that the 
selected upstream routers must maintain per-customer 
state for the requested rate limits. 

Finally, our work also builds on existing preferential 
dropping mechanisms that have been developed for 
providing Quality-of-Service (QoS) [11], [13]. However, 
for providing QoS, the service-level-agreements that 
dictate the bandwidth allocation are assumed to be either 
specified by customers or decided by the operator for the 
purpose of traffic engineering. There is also a body of 
work on measurement-based admission control for deter- 
mining whether or not to admit new traffic into the net- 
work, e.g. [15], [18]. With both service-level-agreement- 
based and admission-control-based bandwidth reserva- 
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tion schemes, rate limits are enforced. Our work here 
is different in that we use preferential dropping for a 
different purpose to provide bandwidth isolation between 
traffic flows to minimize the damage that attack traffic 
can cause to regular traffic. Our solution is based on 
a combination of traffic measurements, fair bandwidth 
allocation, soft admission control at the network perime- 
ter, and lazy dropping of traffic inside the network only 
when needed. As the mechanisms of differential tagging 
and preferential dropping are already available in modern 
routers, our solution is readily deployable. 


III. PROACTIVE SURGE PROTECTION 


In this section, we present a high-level architectural 
overview of a DDoS defense solution called Proactive 
Surge Protection (PSP). To illustrate the basic concept, 
we will depict an example scenario for the Abilene 
network. That network consists of 11 core routers that 
are interconnected by OC192 (10 Gb/s) links. For the 
purpose of depiction, we will zoom in on a portion of 
the Abilene network, as shown in Figure 1(a). Consider 
a simple illustrative situation in which there is a sud- 
den bandwidth-based attack along the origin-destination 
(OD) pair Chicago/NY, where an OD pair is defined to 
be the corresponding pair of ingress and egress nodes. 
Suppose that the magnitude of the attack traffic is 10 
Gb/s. This attack traffic, when combined with the regular 
traffic for the OD pairs Sunnyvale/NY and Denver/NY 
(3 + 3 + 10 = 16 Gb/s), will significantly oversubscribe 
the 10 Gb/s Chicago/NY link, resulting in a high per- 
centage of indiscriminate packet drops. Although the 
OD pairs Sunnyvale/NY and Denver/NY are not under 
direct attack, these flows will also suffer substantial 
packet loss on links which they share with the attack 
OD pair, resulting in significant collateral damage. The 
flows between Sunnyvale/NY and Denver/NY are said 
to be caught in the crossfire of the Chicago/NY attack. 


A. PSP Approach 


The PSP approach is based on providing bandwidth 
isolation between different traffic flows so that the 
amount of collateral damage sustained along crossfire 
traffic flows is minimized. This bandwidth isolation is 
achieved by using a form of soft admission control 
at the perimeter of a provider network. In particular, 
to avoid saturation of network links, we impose rate 
limits on the amount of traffic that gets injected into 
the network for each OD pair. However, rather than 
imposing a hard rate limit, where packets are blocked 
from entering the network, we classify packets into two 
priority classes, high and low. Metering is performed 
at the perimeter of the network, and packets are tagged 
high if the arrival rate is below a certain threshold. But 
when a certain threshold is exceeded, packets will get 
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Fig. 1. Attack scenario on the Abilene network. 


tagged as low priority. Then, when a network link gets 
saturated, e.g. when an attack occurs, packets tagged 
with a low priority will be dropped preferentially. This 
ensures that our solution does not drop traffic unless a 
network link capacity has indeed been exceeded. Under 
normal network conditions, in the absence of sustained 
congestion, packets will get forwarded in the same 
manner as without our solution. 


Consider again the above example, now depicted 
in Figure 1(b). Suppose we set the high priority rate 
limit for the OD pairs Sunnyvale/NY, Denver/NY, and 
Chicago/NY to 3.5 Gb/s, 3.5 Gb/s, and 3 Gb/s, respec- 
tively. This will ensure that the total traffic admitted 
as high priority on the Chicago/NY link is limited to 
10 Gb/s. Operators can also set maximum rate limits 
to some factor below the link capacity to provide the 
desired headroom (e.g. set the target link load to be 
90%). If the limit set for a particular OD pair is above 
the actual amount of traffic along that flow, then all 
packets for that flow will get tagged as high priority. 
Consider the OD pair Chicago/NY. Suppose the actual 
traffic under an attack is 10 Gb/s, which is above the 3 
Gb/s limit. Then, only 3 Gb/s of traffic will get tagged 
as high priority, and 7 Gb/s will get tagged as low 
priority. Since the total demand on the Chicago link 
exceeds the 10 Gb/s link capacity, considerable packets 
would get dropped. However, the packets drop will come 
from the OD pair Chicago/NY since all packets from 
Sunnyvale/NY and Denver/NY would have been tagged 
as high priority. Therefore, the packets for the OD pairs 
Sunnyvale/NY and Denver/NY would be shielded from 
collateral damage. 


Although our simple illustrative example shown in 
Figure | only involved one attack flow from one ingress 
point, the attack traffic in general can be highly dis- 
tributed. As we shall see in Section VI, the proposed 
PSP method is also quite effective in such distributed 
attack scenarios. 
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Fig. 2. Proactive Surge Protection (PSP) architecture. 


B. PSP Architecture 


Our proposed PSP architecture is depicted in Figure 2. 
The architecture is divided into a policy plane and an 
enforcement plane. The traffic data collection and band- 
width allocation components are on the policy plane, and 
the differential tagging and preferential drop components 
are on the enforcement plane. 


Traffic Data Collector: The role of the traffic data 
collection component is to collect and summarize his- 
torical traffic measurements. For example, the widely 
deployed Cisco sampled NetFlow mechanism can be 
used in conjunction with measurement methodologies 
such that those outlined in [14] to collect and derive 
traffic matrices for different times throughout a day, a 
week, a month, etc, between different origin-destination 
(OD) pairs of ingress-egress nodes. The infrastructure for 
this traffic data collection already exists in most service 
provider networks. The derived traffic matrices are used 
to estimate the range of expected traffic demands for 
different time periods. 


Bandwidth Allocator: Given the historical traffic data 
collected, the role of the bandwidth allocator is to deter- 
mine the rate limits at different time periods. For each 
time period ¢, the bandwidth allocator will determine a 
bandwidth allocation matrix, B(t) = | bs,a(t) |, where 
bs,a(t) is the rate limit for the corresponding OD pair 
with ingress node s and egress node d for a particular 
time of day t. For example, a different bandwidth allo- 
cation matrix B(t) may be computed for each hour in a 
day using the historical traffic data collected for same 
hour of the day. Under normal operating conditions, 
network links are typically underutilized. Therefore, traf- 
fic demands from historical measurements will reflect 
this underutilization. Since there is likely to be room 
for admitting more traffic into the high priority class 
than observed in the historical measurements, we can 
fully allocate in some fair manner the available network 
resources to high priority traffic. By fully allocating 
the available network resources beyond the previously 
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observed traffic, we can provide headroom to account 
for estimation inaccuracies and traffic burstiness. The 
bandwidth allocation matrices can be computed offline, 
and operators can remotely configure routers at the 
network perimeter with these matrices using existing 
router configuration mechanisms. 


Differentiated Tagging: Given the rate limits deter- 
mined by the bandwidth allocator, the role of the differ- 
ential tagging component is to perform the metering and 
tagging of packets in accordance to the determined rate 
limits. This component is implemented at the perimeter 
of the network. In particular, packets arriving at ingress 
node s and destined to egress node d are tagged as high 
priority if their metered rates are below the threshold 
given by 6, a(t), using the bandwidth allocation matrix 
B(t) for the corresponding time of day. Otherwise, they 
are tagged as low priority. These traffic management 
mechanisms for metering and tagging are commonly 
available in modern routers at linespeeds. 


Preferential Drops: With packets tagged at the perime- 
ter, low priority packets can be dropped preferentially 
over high priority packets at a network router whenever 
a sustained congestion occurs. Again, this preferential 
dropping mechanism [11] is commonly available in 
modern routers at linespeeds. By using preferential drop 
at interior routers rather than simply blocking packets at 
the perimeter when a rate limit has been reached, our 
solution ensures that no packet gets dropped unless a 
network link capacity has indeed been exceeded. Under 
normal network conditions, in the absence of sustained 
congestion, packets will get forwarded in the same 
manner as without our surge protection scheme. 


ITV. BANDWIDTH ALLOCATION POLICIES 


Intuitively, PSP works by fully allocating the available 
network resources into the high priority class in some 
fair manner so that the high priority class rate limits 
for the different OD pairs are at least as high as the 
expected normal traffic. This way, should a DDoS attack 
occur that would saturate links along the attack route, 
normal traffic corresponding to crossfire OD pairs would 
be isolated from the attack traffic, thus minimizing 
collateral damage. In particular, packets for a particular 
crossfire OD pair would only be dropped at a congested 
network link if the actual normal traffic for that flow 
is above the bandwidth allocation threshold given to 
it. Therefore, bandwidth allocation plays a central role 
in affecting the drop probability of normal crossfire 
traffic during an attack. As such, the goal of bandwidth 
allocation is to allocate the available network resources 
with the objective of minimizing the drop probabilities 
for all OD pairs in some fair manner. 
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A. Formulation 


To achieve the objectives of minimizing drop probabil- 
ity and ensuring fair allocation of network resources, we 
formulate the bandwidth allocation problem as a utility 
max-min fair allocation problem [8], [9], [26], [23]. The 
utility max-min fair allocation problem can be stated as 
follows. Let @ = (#1,%2,...,@y) be the allocation to 
N flows, and let ((1(%1), G2(x2),...,8n(an)) be N 
utility functions, with each (;(x;) corresponding to the 
utility function for flow 7. An allocation % is said to 
be utility max-min fair if and only if increasing one 
component x; must be at the expense of decreasing some 
other component x; such that ;(7;) < 6; (a:). 

Conventionally, the literature on max-min 
fair allocation uses the vector notation Z(t) = 
(a(t), vo(t),...,2n(t)) to represent the allocation 
for some time period t. The correspondence to our 
bandwidth allocation matrix B(t) = [ bs.a(t) | is 
straightforward: bs, 4,(t) = a(t) is the bandwidth 
allocation at time ¢ for flow i, with the corresponding 
OD pair of ingress and egress nodes (s;,d;). Unless 
otherwise clarified, we will use the conventional vector 
notation Z(t) = (a(t), xo(t),...,an(t)) and our 
bandwidth allocation matrix notation interchangeably. 

The utility max-min fair allocation problem has been 
well-studied, and as shown in [9], [26], the problem 
can be solved by means of a “water-filling” algorithm. 
We briefly outline here how the algorithm works. The 
basic idea is to iteratively calculate the utility max- 
min fair share for each flow in the network. Initially, 
all flows are allocated rate x; = O and are considered 
free, meaning that its rate can be further increased. 
At each iteration, the water-filling algorithm aims to 
find largest increase in bandwidth allocation to free 
flows that will result in the maximum common utility 
with the available link capacities. The provided utility 
functions, ((1(%1), G2(%2),...,8n(a@n)), are used to 
determine this maximum common utility. When a link is 
saturated, it is removed from further consideration, and 
the corresponding flows that cross these saturated links 
are fixed from further increase in bandwidth allocation. 
The algorithm converges after at most L iterations, where 
LI is the number of links in the network, since at least 
one new link becomes saturated in each iteration. The 
reader is referred to [9], [26] for detailed discussions. 

In the context of PSP, the utility max-min fair algo- 
rithm is used to implement different bandwidth alloca- 
tion policies. In particular, we describe in this section 
two bandwidth allocation policies, one called Mean- 
PSP, and the other called CDF-PSP. Both are based 
on traffic data collected from historical traffic mea- 
surements. The first policy, Mean-PSP, simply uses the 
average historical traffic demands observed as weights in 
the corresponding utility functions. Mean-PSP is based 
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TABLE I 
TRAFFIC DEMANDS AND THE CORRESPONDING BANDWIDTH 
ALLOCATIONS FOR MEAN-PSP AND CDF-PSP. 


























Flows | Historical traffic measurements BW allocation 
Measured demands Mean | Mean-PSP CDF-PSP 
(sorted) Ist | 2nd | Ist | 2nd 
(A,D) | 1 | 1] 2 | 2 4 2 2 2 2 2 
(B,D) | 1/1] 1/3 4 2 2 2 3 3 
(C.D) | 4/5 ]5 1/5 6 6 6 5 5 
(A.C) | 4/5 ]5/] 5 6 6 8 5 8 
(B,C) | 5/5 ]6] 6 6 6 8 6 7 









































on the simple intuition that flows with higher average 
traffic demands should receive proportionally higher 
bandwidth allocation. This policy was first presented in 
our preliminary work [10]. However, this policy does 
not directly consider the traffic variance observed in the 
traffic measurements. 

To directly account for traffic variance, we propose a 
second policy, CDF-PSP, that explicitly aims to minimize 
drop probabilities by using the Cumulative Distribu- 
tion Functions (CDFs) [8] derived from the empirical 
distribution of traffic demands observed in the traffic 
measurements. These CDFs can be used to capture the 
probability that the actual traffic will not exceed a par- 
ticular bandwidth allocation. When these CDFs are used 
as utility functions, maximizing the utility corresponds 
directly to the minimization of drop probabilities. Each 
of these two policies is further illustrated next. 


B. Mean-PSP.: Mean-based Max-min Fairness 


Our first allocation policy, Mean-PSP, simply uses the 
mean traffic demand as the utility function. In particular, 
the utility function for flow 2 is a simple linear function 
B(x) = ui, Where j4; is the mean traffic demand of flow 
2, which simplifies to an easier weighted max-min fair 
allocation problem. 

To illustrate how Mean-PSP works, consider the small 
example shown in Figure 3. It depicts a simple network 
topology with 4 nodes that are interconnected by 10 Gb/s 
links. Consider the corresponding traffic measurements 
shown in Table I. For simplicity of illustration, each flow 
is described by just 5 data points, and the corresponding 
mean traffic demands are also indicated in Table I. 
Consider the first iteration of the Mean-PSP water-filling 
procedure shown in Figure 4(a). The maximum common 
utility that can be achieved by all free flows is G(x) = 1, 
which corresponds to allocating 2 Gb/s each to the 
OD pairs (A, D) and (B,D) and 6 Gb/s each to the 
OD pairs (C,D), (A,C), and (B,C). For example, 
Ba,p(x) = % = 1 corresponds to allocating x = 2 Gb/s 
since 4 for (A, D) is 2. Since all three flows, (A, D), 
(B,D), and (C, D), share a common link C’D, the sum 
of their first iteration allocation, 2 + 2 + 6 = 10 Gb/s, 
would already saturate link C'D. This saturated link is 


17th USENIX Security Symposium 


removed from consideration in subsequent iterations, and 
the flows (A, D), (B,D), and (C,D) are fixed at the 
allocation of 2 Gb/s, 2 Gb/s, and 6 Gb/s, respectively. 

On the other hand, link AC is only shared by flows 
(A,C) and (A, D), which has an aggregate allocation 
of 2 + 6 = 8 Gb/s on link AC after the first iteration. 
This leaves 10 — 8 = 2 Gb/s of residual capacity for 
the next iteration. Similarly, link BC is only shared by 
flows (B,C) and (B,D), which also has an aggregate 
allocation of 2 + 6 = 8 Gb/s on link BC after the first 
iteration, with 2 Gb/s of residual capacity. After the first 
iteration, flows (A, C) and (B,C) remain free. 

In the second iteration, as in shown Figure 4(b), the 
maximum common utility is achieved by allocating the 
remaining 2 Gb/s on link AC to flow (A,C) and the 
remaining 2 Gb/s on link BC to flow (B,C), resulting 
in each flow having 8 Gb/s allocated to it in total. The 
final Mean-PSP bandwidth allocation is shown in Table I. 


C. CDF-PSP.:: CDF-based Max-min Fairness 


Our second allocation policy, CDF-PSP, aims to ex- 
plicitly capture the traffic variance observed in historical 
traffic measurements by using a Cumulative Distribution 
Function (CDF) model as the utility function. In partic- 
ular, the use of CDFs [8] captures the acceptance prob- 
ability of a particular bandwidth allocation as follows. 
Let X;(t) be a random variable that represents the actual 
normal traffic for flow 7 at time ¢, and let x;(t) be the 
bandwidth allocation. Then the CDF of X;(t) is denoted 
as 

Pr{Xi(t) < xi(t)] = ®:2(2i(2)), 


and the drop probability is simply the complementary 
function 


Therefore, when CDFs are used to maximize the accep- 
tance probabilities for all flows in a max-min fair man- 
ner, it is equivalent to minimizing the drop probabilities 
for all flows in a min-max fair manner. 

In general, the expected traffic can be modeled using 
different probability density functions with the corre- 
sponding CDFs. One probability density function is to 
use the empirical distribution that directly corresponds 
to the historical traffic measurements taken. In particu- 
lar, let (r;,1(t), ria(t),...,7i,a2(t)) be M measurements 
taken for flow 2 at a particular time of day ¢ over some 
historical data set. Then the empirical CDF is simply 
defined as 


__ # measurements < «;(t) 


©; ¢(ai(t)) = M 
1 M 
= Mu = drill < tet) 
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where I(rj,,(t) < x;(t)) is the indicator that the mea- 
surement 7;;,(t) is less than or equal to x;(t). For the 
example shown in Table I, the corresponding empirical 
CDFs are shown in Figure 6. For example in Figure 6(a) 
for OD pair (A, D), a bandwidth allocation of 2 Gb/s 
would correspond to an acceptance probability of 80% 
(with the corresponding drop probability of 20%). 

To illustrate how CDF-PSP works, consider again the 
example shown in Figure 3 and Table I. Consider the first 
iteration of the CDF-PSP water-filling procedure shown 
in Figure 5(a). To simplify notation, we will simply use 
for example 64 p(x) = ®4,p(2) to indicate the utility 
function for flow (A, D) for some time period t, and we 
will use analogous notations for the other flows. 

In the first iteration, the maximum common utility 
that can be achieved by all free flows is an acceptance 
probability of ((2) = 80%, which corresponds to 
allocating 2 Gb/s to (A, D), 3 Gb/s to (B,D), 5 Gb/s 
each to (C, D) and (A,C), and 6 Gb/s to (B,C). This 
first iteration allocation is shown in bold black lines in 
Figure 6. With this allocation in the first iteration, link 
CD is again saturated since the sum of the first iteration 
allocation to flows (A, D), (B,D), and (C,D) is 2 + 
3 + 5 = 10 Gb/s, which would already reach the link 
capacity of CD. Therefore, the saturated link CD is 
removed from consideration in subsequent iterations, and 
the flows (A, D), (B,D), and (C,D) are fixed at the 
allocation of 2 Gb/s, 3 Gb/s, and 5 Gb/s, respectively. 

For link AC, which is shared by flows (A,C) and 
(A, D), the first iteration allocation is 2 + 5 = 7 Gb/s, 
leaving 10—7 = 3 Gb/s of residual capacity. Similarly, for 
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link BC, which is shared by flows (B,C) and (B, D), 
the first iteration allocation is 3 + 6 = 9 Gb/s, leaving 
10 — 9 = 1 Gb/s of residual capacity. 


In the second iteration, as in shown Figure 5(b), 
the maximum common utility 90% is achieved for the 
remaining free flows (A,C) and (B,C) by allocating 
the remaining 3 Gb/s on link AC to flow (A, C) and the 
remaining | Gb/s on link BC to flow (B, C), resulting in 
a total of 8 Gb/s allocated to (A, C’) and 7 Gb/s allocated 
to (B,C). This second iteration allocation is shown in 
dotted lines in Figure 6. The final CDF-PSP bandwidth 
allocation is shown in Table I. 


Comparing the results for CDF-PSP and Mean-PSP 
shown in Figure 6 and Table I, we see that CDF-PSP was 
able to achieve a higher worst-case acceptance probabil- 
ity for all flows than Mean-PSP. In particular, the CDF- 
PSP results shown in Figure 6 and Table I show that 
CDF-PSP was able to achieve a minimum acceptance 
probability of 80% for all flows whereas Mean-PSP 
was only able to achieve a lower worst-case acceptance 
probability of 70%. For example, for flow (B,D), the 
bandwidth allocation of 3 Gb/s determined by CDF- 
PSP corresponds to an 80% acceptance rate whereas 
the 2 Gb/s determined by Mean-PSP only corresponds 
to a 70% acceptance rate. The better worst-case result 
is because CDF-PSP specifically targets the max-min 
optimization of the acceptance probability by using the 
cumulative distribution function as the objective. 
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V. EXPERIMENTAL SETUP 


We employed ns-2 based simulations to evaluate our 
PSP methods on two large real networks. 


US: This is the backbone of a large service provider in 
the US, and consists of around 700 routers and thousands 
of links ranging from T1 to OC768 speeds. 


EU: This is the backbone of a large service provider 
in Europe. It has a similar network structure as the US 
backbone, but it is larger with about 150 more routers 
and 500 more links. 


While the results for the individual networks cannot be 
directly compared to each other because of differences 
in their network characteristics and traffic behavior, 
multiple network environments allow us to explore and 
understand the performance of our PSP methods for a 
range of diverse scenarios. 


A. Normal Traffic Demand 


For each network, using the methods outlined in [14], 
we build ingress router to egress router traffic ma- 
trices from several weeks worth of sampled Net- 
flow data that record the traffic for that network : 
US (07/01/07—09/03/07) and EU (11/18/06—12/18/06 
& 07/01/07—09/03/07). Specifically, the Netflow data 
contains sampled Netflow records covering the entire 
network. The sampling is performed on the routers with 
1:500 packet sampling rate. The volume of sampled 
records are then subsequently reduced using a smart 
sampling technique [12]. The total size of smart sampled 
data records was 3,600 GB and 1,500 GB for US and 
EU, respectively. Finally, we annotate each record with 
its customer egress interface (if it was not collected on 
the egress router) based on route information. 

For each time interval 7, the corresponding OD flows 
are represented by a N x N traffic matrix where N is 
the number of access routers providing ingress or egress 
to the backbone, and each entry contains the average 
demand between the corresponding routers within that 
interval. The above traffic data are used both for creating 
the normal traffic demand for the simulator as well as 
for computing the corresponding bandwidth allocation 
matrices for the candidate PSP techniques. One desirable 
characteristic from a network management, operations 
and system overhead perspective is to avoid too many 
unnecessary fine time scale changes. Therefore, one goal 
of our study was to evaluate the effectiveness of using 
a single representative bandwidth allocation matrix for 
an extended period of time. An implicit hypothesis is 
that the bandwidth allocation matrix does not need to 
be computed and updated on a fine timescale. To this 
end, in the simulations, we use a finer timescale traffic 
matrix with st = 1 min for determining the normal 
traffic demand, and a coarser timescale 1 hour interval 
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for computing the bandwidth allocation matrix from 
historical data sets. 


B. DDoS Attack Traffic 


To test the robustness of our PSP approach, we used 
two different types of attack scenarios for evaluation — 
a distributed attack scenario for the US backbone and 
a targeted attack scenario for the EU backbone. As we 
shall see in Section VI, PSP is very effective in both 
types of attacks. In particular, we used the following 
attack data. 


US DDoS: For the US backbone, the attack matrix that 
we used for evaluation is based on large DDoS alarms 
that were actually generated by a commercial DDoS de- 
tection system deployed at key locations in the network. 
In particular, among the actual large DDoS alarms there 
were generated during the period of 6/1/05 to 7/1/06, 
we selected the largest one involving the most number 
of attack flows as the attack matrix. This was a highly 
distributed attack involving 40% (nearly half) of the 
ingress routers as attack sources and 25% of the egress 
routers as attack destinations. The number of attack flows 
observed at a single ingress router were up to 150 flows, 
with an average of about 24 attack flows sourced at each 
ingress router. The attacks were distributed over a large 
number of egress routers. Although the actual attacks 
were large enough to trigger the DDoS alarms, they did 
not actually cause overloading on any backbone link. 
Therefore, we scaled up each attack flow to an average 
of 1% of the ingress router link access capacity. Since 
there were many flows, this was already sufficient to 
cause overloading on the network. 


EU DDoS: For the Europe backbone, we had no com- 
mercial DDoS detection logs available. Therefore, we 
created our own synthetic DDoS attack data. To eval- 
uate PSP under different attack scenarios, we created 
a targeted attack scenario in which all attack flows 
are targeted to only a small number of egress routers. 
In particular, to mimic the US DDoS attack data, we 
randomly selected 40% of ingress routers to be attack 
sources. However, to create a targeted attack scenario, 
we purposely selected at random only 2% of the egress 
routers as attack destinations. With only 2% of the egress 
routers involved as attack destinations, we concentrated 
the attacks from each ingress router to just 1-3 destina- 
tions with demand set at 10% of the ingress router link 
access capacity. 


C. ns-2 Simulation Details 


Our experiments are implemented using ns-2 simula- 
tions. This involved implementing the 2-class bandwidth 
allocation, and simulating both the normal and DDoS 
traffic flows. 
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Bandwidth Allocation and Enforcement: The metering 
and class differentiation of packets are implemented 
at the perimeter of each network using the differen- 
tiated service module in ns-2, which allows users to 
set rate limits for each individual OD pair. Our simu- 
lation updates the rate limits hourly by pre-computing 
the bandwidth allocation matrix based on the histori- 
cal traffic matrices that were collected several weeks 
prior to the attack date: US (07/01/07—09/02/07) and 
EU (11/18/06—12/17/06 & 07/01/07—09/02/07). 

The differentiated service module marks incoming 
packets into different priorities based on the configured 
rate limits set by our bandwidth allocation matrix and 
the estimated incoming traffic rate of the OD pair. 
Specifically, we implemented differentiated service using 
TSW2CM (Time Sliding Window with 2 Color Mark- 
ing), an ns-2 provided policer. As its name implies, the 
TSW2CM policer uses a sliding time window to estimate 
the traffic rate. 

If the estimated traffic exceeds the given threshold, 
the incoming packet is marked into the low priority 
class; otherwise, it is marked into the high priority class. 
We then use existing preferential dropping mechanisms 
to ensure that lower priority packets are preferentially 
dropped over higher priority packets when memory 
buffers get full. In particular, WRED/RIO! is one such 
preferential dropping mechanism that is widely deployed 
in existing commercial routers [11], [5]. We used this 
WRED/RIO mechanism in our ns-2 simulations. 


Traffic Simulation: For simulation data (testing phase), 
we purposely used a different data set than the traffic 
matrices used for bandwidth allocation (learning phase). 
In particular, for each network, we selected a week-day 
outside of the days used for bandwidth allocation, and 
we considered 48 1-minute time intervals (one every 30- 
minutes) across the entire 24 hours of this selected day. 
The exact date that we selected to simulate normal traffic 
is 09/03/07 for both the US and EU networks. Recall 
that for a given time interval 7, we compute normal 
and DDoS traffic matrices that give average traffic rates 
across that interval. These matrices are used to generate 
the traffic flows for that time interval. Both DDoS and 
network traffic are simulated as constant bandwidth UDP 
streams with fixed packet sizes of | kB. 


VI. EXPERIMENTAL RESULTS 


We begin our evaluations in Section VI-A by quan- 
tifying the potential extent and severity of the problem 
that we are trying to address — the amount of collateral 
damage in each network in the absence of any protection 
mechanism. We then develop an understanding of the 
damage mitigation capabilities and properties of our PSP 


'RIO is WRED with two priority classes. 
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mechanism, first at the network level in Section VI-B 
and then at the individual OD-pair level in Section VI- 
C. Section VI-D explores the effectiveness of the pro- 
posed schemes under scaled attacks, and Section VI-E 
summarizes all the results. 

We shall use the term No-PSP to refer to the baseline 
scenario with no surge protection. We use the terms 
Mean-PSP and CDF-PSP to refer to the PSP schemes 
that use proportional and empirical CDF-based water- 
filling bandwidth allocation algorithms respectively. Re- 
call that an OD pair is considered as (i) an attacked OD 
pair if there is attack traffic along that pair, (ii) a crossfire 
OD pair if it shares at least one link with an OD pair 
containing attack traffic, and (iii) a non-crossfire OD 
pair if it is neither an attacked nor a crossfire OD pair. 


A. Potential for Collateral Damage 


We first explore the extent to which OD pairs and their 
offered traffic demands are placed in potential harm’s 
way because they share network path segments with a 
given set of attack flows. In Figure 7, we report the 
relative proportion of OD pairs in the categories of 
attacked, crossfire, and non-crossfire OD pairs for both 
the US and EU backbones. 

As described in Section V-B, 40% of the ingress 
routers and 25% of the egress routers were involved in 
the DDoS attack on the US backbone. In general, for 
a network with N ingress/egress routers, there are N? 
possible OD pairs (the ratio of routers to OD pairs is 
1-to-NV). For the US backbone, with about 700 routers, 
there are nearly half a million OD pairs. Although 40% 
of the ingress routers and 25% of the egress routers were 
involved in the attack, the number of attack destinations 
from each ingress router was on average about 24 egress 
routers, resulting in just 1.2% of the OD pairs under 
direct attack. In general, because the number of OD pairs 
grows quadratically with N (i.e. N?), even in a highly 
distributed attack scenario where the attack flows come 
from all N routers, the number of OD pairs under direct 
attack may still only correspond to a small percentage 
of OD pairs. For the EU backbone, there are about 850 
routers and about three quarters of million OD pairs. 
For the targeted attack scenario described in Section V- 
B, 40% of the ingress routers were also involved in the 
DDoS attack, but the attacks were concentrated to just 
2% of the egress routers. Again, even though 40% of 
the ingress routers were involved, only 0.1% of the OD 
pairs, among N? OD pairs, were under direct attack. 

In general, the percentage of OD pairs that are in the 
crossfire of attack flows depends on where the attacks 
occurred and how traffic is routed over a particular 
network. For the US backbone, we observe that the 
percentage of crossfire OD pairs is very large (95.5%), 
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TABLE II 
COLLATERAL DAMAGE IN THE ABSENCE OF PSP WITH THE 10¢” 
AND 90¢” PERCENTILE INDICATED IN THE BRACKETS. 











Impacted Impacted Mean packet loss rate 
OD Pairs(%) Demand(%) of impacted OD pairs(%) 
US 41.37 37.79 49.15 
[39.64, 42.72] | [35.16, 39.37] [47.62, 50.43] 
EU 43.18 45.33 68.11 
[38.48, 47.81] | [38.90, 52.05] [65.51, 70.46] 























causing substantial collateral damage even though the 
attacks were directed over only 1.2% the OD pairs. This 
is somewhat expected given the distributed nature of 
the attack where a high percentage of both ingress and 
egress routers were involved in the attack. For the EU 
backbones, the observed percentage of crossfire OD pairs 
is also very large (83.5%). This is somewhat surprisingly 
because the attacks were targeted to only a small number 
of egress routers. This large footprint can be attributed 
to the fact that even a relatively small number of attack 
flows can go over common links that were shared by a 
vast majority of other OD pairs. 

We next depict the relative proportions of the overall 
normal traffic demand corresponding to each type of OD 
pairs. While the classification of the OD pairs into the 3 
categories is fixed for a given network and attack matrix, 
the relative traffic demand for the different classes is 
time-varying, depending on the actual normal traffic 
demand in a given time interval. Figure 8 presents a 
breakdown of the total normal traffic demands for the 3 
classes across the 48 time intervals that we explored. 
Note that for both the networks, crossfire OD pairs 
account for a significant proportion of the total traffic 
demand. Figures 7 and 8 together suggest that an attack 
directed even over a relatively small number of ingress- 
egress interface combinations, could be routed around 
the network in a manner that can impact a significant 
proportion of OD pairs and overall network traffic. 

The results above provide us an indication of the 
potential “worst-case” impact footprint that an attack can 
unleash, if its strength is sufficiently scaled up. This is 
because a crossfire OD pair will suffer collateral packet 
losses only if some link(s) on its path get congested. 
While the above results do not provide any measure of 
actual damage impact, they do nevertheless point to the 
existence of a real potential for widespread collateral 
damage, and underline the importance and urgency of 
developing techniques to mitigate and minimize the 
extent of such damage. 

We next consider the actual collateral damage induced 
by the specified attacks in the absence of any protection 
scheme. We define a crossfire OD pair to be impacted 
in a given time interval, if it suffered some packet loss 
in that interval. Table II presents (i) the total number of, 
and (ii) traffic demand for the impacted OD pairs as a 
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percentage of the corresponding values for all crossfire 
OD pairs, and (iii) the mean packet loss rate across 
the impacted OD pairs. To account for time variability, 
we present the average value (with the 10% and 90” 
percentile indicated in the brackets) for the three metrics 
across the 48 attacked time intervals. Overall, the tables 
show that not only can the attacks impact a significant 
proportion of the crossfire OD pairs and network traffic, 
but that they can cause severe packet drops in many of 
them. For example, in the EU network, in 90% of the 
time intervals, (i) at least 39.64% of the cross-fire OD 
pairs were impacted, and (ii) the average packet loss 
rate across the impacted OD pairs was 47.62% or more. 
To put these numbers in proper context, note that TCP, 
which accounts for the vast majority of traffic today, is 
known to have severe performance problems once the 
loss rate exceeds a few single-digit percentage points. 


B. Network-wide PSP Performance Evaluation 


We start the evaluation of PSP by focusing on 
network-wide aggregate performance for crossfire OD 
pairs and note the consistent substantially lower loss 
rates under either Mean-PSP or CDF-PSP across the 
entire day. 

1) Total Packet Loss Rate: 

For each attack time interval, we compute the total 
packet loss rate which is the total number of packets 
lost as a percentage of the total offered load from all 
crossfire OD pairs. Table III summarizes the mean, 10” 
and 90” percentile of the total packet loss rates across 
48 attack time intervals. The mean loss rates under No- 
PSP in US and EU networks are 17.93% and 30.48%, 
respectively. The loss rate is relatively stable across time 
as indicated by the tight interval between the 10” and 
90‘” percentile numbers. In contrast, the mean loss rate 
is much smaller, less than 3%, for either PSP scheme. 
Figure 9 shows the loss rate across time, for the 2 PSP 
schemes, expressed as a percentage of the corresponding 
loss rates under No-PSP. Note that even though the attack 
remains the same over all 48 attack time intervals, the 
normal traffic demand matrix is time-varying, and hence 
the observed variability in the time series. In particular, 
we observe comparatively smaller improvements during 
the the network traffic peak times, such as 12PM (GMT) 
in the EU backbone and 6PM (GMT) in the US back- 
bone. This behavior is because the amount of traffic 
that could be admitted as high priority is bounded by 
the network’s carrying capacity. During high demand 
time intervals, on one hand, links will be more loaded 
increasing the likelihood of congestion and overload. 
On the other hand, more packets will get classified 
as low priority, increasing the population size that can 
be dropped under congestion and overload. Table IV 
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Fig. 7. The percentage of the number of the three OD pair types 
classified under an attack traffic. 
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TABLE III 
THE TIME-AVERAGED CROSSFIRE OD-PAIR TOTAL 
PACKET LOSS RATE WITH THE 10¢” AND 90¢? 
PERCENTILE INDICATED IN THE BRACKETS. 


4.1% _ 13.0% 0.8% 
82.9% -, 71.5% [27.7% ~ 
(a) US. (b) Europe. 


Fig. 8. The proportion of normal traffic demand corre- 
sponding to the three types of OD pairs. 
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The crossfire OD pair total packet loss rate ratio over No-PSP across 24 hours.(48 attack time intervals, 30 minutes apart). 


TABLE IV 


THE TIME-AVERAGED TOTAL PACKET LOSS REDUCTION RELATIVE TO 
No-PSP OR MEAN-PSP WITH THE 10*” AND 90¢” PERCENTILE 


INDICATED IN THE BRACKETS. 















































No-PSP Mean-PSP CDF-PSP Reduction ratio | Reduction ratio | Reduction ratio 
US 17.93 1.63 1.11 from No-PSP from No-PSP from Mean-PSP 
[16.40, 18.79] [1.02, 2.14] (0.47, 1.71] to Mean-PSP to CDF-PSP to CDF-PSP 
EU 30.48 2.73 2.32 US 91.00 93.90 34.75 
[27.22, 32.86] | [1.21, 4.54] | [0.79, 4.22] [88.56, 93.89] (90.77, 97.21] [20.06, 53.09] 
EU 91.17 92.51 19.90 
[85.79, 96.17] [86.46, 97.58] [4.01, 41.58] 























summarizes the performance improvements for the PSP 
schemes in terms of relative loss rate reduction to No- 
PSP or Mean-PSP across the different time intervals. For 
each network, on average, either PSP scheme reduces the 
loss rate in a time interval by more than 90% from the 
corresponding No-PSP value. In addition CDF-PSP has 
consistently better performance than Mean-PSP with loss 
rates that are on average 34.75% and 19.90% lower for 
the US and EU networks, respectively. 


2) Mean OD Packet Loss Rate: 


Our second metric is the mean OD packet loss rate 
which measures the average packet loss rate across all 
crossfire OD pairs with non-zero traffic demand. For 
each of the 48 attack time intervals, for each crossfire 
OD pair that had traffic demand in that interval, we 
compute its packet loss rate, ie., the number of packets 
dropped as a percentage of its total offered load. The 
mean OD packet loss rate is obtained by averaging 
across these per-OD pair loss rates for that interval. 
Table V presents the average, 10°” and 90°” percentile 
values for that metric across the 48 time intervals for the 
different PSP scenarios. Figure 10 shows the time series 
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of the metric for Mean-PSP and CDF-PSP, expressed as 
a percentage of the corresponding value for No-PSP. The 
table and the figure clearly show that, across time, No- 
PSP had consistently much higher mean OD packet loss 
rate than Mean-PSP and CDF-PSP, while CDF-PSP has 
the best performance. The percentage improvements are 
summarized in Table VI, which show that going from 
No-PSP to CDF-PSP results in a reduction in the mean 
OD packet loss rate by 87.50% and 89.93% for the US 
and EU networks, respectively. Moving from Mean-PSP 
to CDF-PSP reduces this loss rate metric by 33.20% and 
25.46% respectively in the two networks. 


3) Number of impacted crossfire OD pairs: We next 
determine the number of impacted OD pairs, ie., the 
crossfire OD pairs that suffer some packet loss at each 
time interval. It is desirable to minimize this number, 
since many important network applications including 
real-time gaming and VOIP are very sensitive to and 
experience substantial performance degradations even 
under relatively low packet loss rates. For each of the 
48 attack time intervals, we determine the number of 
impacted crossfire OD pairs as a percentage of the 
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Fig. 10. The mean OD packet loss rate ratio over No-PSP across 24 Fig. 11. The ratio of number of crossfire OD-pairs with packet loss over 
hours.(48 attack time intervals, 30 minutes apart). No-PSP across 24 hours.(48 attack time intervals, 30 minutes apart). 
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THE TIME-AVERAGED CROSSFIRE OD-PAIR MEAN 


TABLE VI 
THE TIME-AVERAGED CROSSFIRE OD-PAIR MEAN PACKET LOSS 
PACKET LOSS RATE. THE 10” aNp 90°" PERCENTILE 
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No-PSP Mean-PSP CDF-PSP Reduction ratio | Reduction ratio | Reduction ratio 
US 20.33 3.75 2.56 from No-PSP from No-PSP from Mean-PSP 
[19.25, 21.07] | [2.69, 4.31] | [1.33, 3.39] to Mean-PSP to CDF-PSP to CDF-PSP 
EU 29.34 4.04 3.23 US 81.65 87.50 33.20 
[26.62, 32.16] [2.02, 6.71] [1.09, 5.98] [79.27, 86.19] [83.88, 93.33] [19.65, 52.84] 
EU 86.63 89.39 25.46 
[79.01, 92.77] (81.15, 95.92] [9.83, 44.94] 
TABLE VII TABLE VIII 
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No-PSP Mean-PSP CDF-PSP THE BRACKETS. 
US 41.37 12.85 7.16 Reduction ratio | Reduction ratio | Reduction ratio 
[39.06, 42.73] | [9.58, 14.58] [3.94, 9.24] from No-PSP from No-PSP from Mean-PSP 
EU 43.18 12.81 8.79 to CDF-PSP to CDF-PSP to CDF-PSP 
[38.43, 47.94] [7.28, 19.70] [3.84, 15.46] US 69.05 82.82 45.47 
[65.20, 75.64] [78.11, 90.22] [35.12, 59.30] 
EU 71.18 80.42 34.94 
(58.62, 81.49] [67.66, 90.36] [21.72, 47.60] 























total number of crossfire OD pairs with non-zero traffic 
demand in that time interval. We summarize the mean 
and the 10°” and 90°” percentiles from the distribution 
of the resulting values across the 48 time intervals in 
Table VII for No-PSP and the two PSP schemes. The 
mean proportion of impacted OD pairs drops from a 
high of 41.37% under No-PSP to 12.85% for No-PSP 
to 7.16% for CDP-PSP. We present the time series of 
the proportion of impacted OD pairs for the two PSP 
schemes (normalized by the corresponding value for No- 
PSP) across the 48 time intervals in Figure 11, and 
summarize the savings from the 2 PSP schemes in Ta- 
ble VIII. Across all the time intervals, we note that a high 
percentage of crossfire OD pairs had packet losses under- 
No-PSP, and that both PSP schemes dramatically reduce 
this proportion, with CDF-PSP consistently having the 
lowest proportion of impacted OD pairs. Considering 
the Table VII , the proportion of impacted OD pairs 
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in the US network is reduced, on average, by over 69% 
going from No-PSP to Mean-PSP. From Mean-PSP to 
CDF-PSP, the proportion drops, on average, by a further 
substantial 45.47%. 


C. OD pair-level Performance 


In Section VI-B, we explored the performance of the 
PSP techniques from the overall network perspective. 
We focus the analysis below on the performance of 
individual crossfire OD pairs across time. 

1) Loss Frequency: For each crossfire OD pair, we 
define its loss frequency to be the percentage of of 
the 48 attack time intervals in which it incurred some 
packet loss. Note that this metric only captures how 
often across the different times of day, a crossfire OD 
pair experiences loss events, and is not meant to capture 
the actual magnitude of individual loss events which 
we shall study later. Figure 12 plots the cumulative 
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Fig. 12. CDF of the loss frequency for all crossfire OD pairs. 


distribution function (CDF) of the loss frequencies across 
all the crossfire OD pairs which had some traffic in any 
of the 48 intervals. In the figure, a given point (x, y) 
indicates that y percent of crossfire OD-pairs had packet 
loss in at most x percent of the attack time intervals. 
Therefore, corresponding to the same x value, the larger 
the y value for a PSP scheme, the better because that 
indicates that the scheme had a higher percentage of 
OD pairs with loss frequency less or equal to x. The 
figure shows that across the range of loss frequencies, 
CDF-PSP always has the highest percentage of OD 
pairs comparing to the other PSP schemes at any given 
x value. In particular, both CDF-PSP and Mean-PSP 
substantially increase the number of OD pairs without 
packet loss at any of 48 attack time intervals, with CDF- 
PSP performing the best. The percentage of OD pairs 
with 0% loss frequency increase from 55.86% for No- 
PSP to 62.83% for Mean-PSP and 72.97% for CDF-PSP 
for the US network. The corresponding values for the EU 
network are 50.44%, 63.22% and 70.91%, respectively. 
In addition, for the US network, 98% of the OD pairs 
have loss frequencies bounded by 22.92% under Mean- 
PSP and 18.75% under CDF-PSP. Considering the 98% 
coverage of the OD pairs population under No-PSP, 
the bounding loss frequency is a much higher 66.67%. 
Thus, using either Mean-PSP or CDF-PSP substantially 
reduces the loss frequency for a large proportion of the 
crossfire OD pairs. 

2) Packet Loss Rate per OD pair: 

After exploring how often packet losses occur, we 
next analyze the magnitude of packet losses for different 
crossfire OD pairs. An OD-pair can have different loss 
rates at different attack time intervals, and here for each 
crossfire OD pair, we consider the 90” percentile of 
these loss rates across time, where we consider only 
time intervals where that OD pair had non-zero traffic 
demand. Figure 13 shows the cumulative distribution 
function (CDF) of this 90” percentile packet loss rate 
across all crossfire OD-pairs, except those that had no 
traffic demand during the entire 48 attack time intervals. 
In the figure, a given point (x, y) indicates that for y% 
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Fig. 13. CDF of the 90 percentile packet loss rate for all crossfire OD 
pairs. 


of crossfire OD-pairs, in 90% of the time intervals in 
which that OD pair had some traffic demand, the packet 
loss was at most 7%. The most interesting region from 
a practical performance perspective lies to the left of the 
graph for low values of the loss rate. This is because 
many network applications and even reliable transport 
protocols like TCP have very poor performance and 
are practically unusable beyond a loss rate of a few 
percentage points. Focussing on 0 — 10% loss rate range 
which is widely considered to include this “habitable 
zone of loss rates’, the figure shows that both Mean-PSP 
and CDF-PSP both have substantially higher percentage 
of OD pairs in this zone, compared to No-PSP, and 
that CDF-PSP has significantly better performance. For 
example, the US network, the percentage of OD pair 
with less than 10% loss rate increases from just 59% 
for No-PSP to 70.48% for Mean-PSP and 79.62% for 
CDF-PSP. The trends are similar for the EU network. 


It should be noted that towards the tail of the distribu- 
tion, for very large values of the loss rate, the percentage 
of OD pairs that have less than a certain loss rate x is 
not always greater for CDF-PSP than for Mean-PSP. We 
defer the explanation for this to Section VI-C.4 where we 
analyze the packet losses of a OD-pair under different 
PSP schemes in greater detail. 


3) Correlating Loss Rate with OD pair characteris- 
tics: 

The loss rate experienced by an OD pair for a PSP 
scheme is a function of various factors including the 
historical traffic demand for that OD pair which influ- 
ences the admission decisions to the high priority class. 
To understand the relationship, we consider 2 simple fea- 
tures of its historical traffic profile. The historical traffic 
demand of an OD pair is the traffic demand for that 
OD pair averaged across all the historical time intervals. 
The historical activity factor is the percentage of time 
intervals that the OD pair had some traffic demand out 
of all historical time intervals. We explore the relation 
between each of these features and the 90°” percentile 
packet loss rate defined in the previous subsection in the 
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Fig. 14. The correlation scatter plot for all crossfire OD-pairs between Fig. 15. The correlation scatter plot for all crossfire OD-pairs between 
its 90 percentile OD packet loss rate under No-PSP/CDF-PSP and its its 90 percentile OD packet loss rate under No-PSP/CDF-PSP and its 


historical traffic demand. 


scatter plots in Figures 147 and 15°, where each dot 
corresponds to a crossfire OD pair and the location of 
the dot is determined by its 90” percentile packet loss 
rate and either its historical demand (Figure 14) or its 
historical activity factor (Figure 15). 

Comparing the results for No-PSP and CDF-PSP in 
the 2 figures, we note that unlike No-PSP, under CDF- 
PSP, the top right region in the plots are empty and that 
no OD pair with high historical demand or high historical 
activity has a high loss rate. Since the historical demand 
and activity factor values for an OD pair does not change 
from No-PSP to CDF-PSP, the scatter plots indicate that 
for many high demand or high activity factor OD pairs, 
the loss rates are dramatically reduced going from No- 
PSP to CDF-PSP, shifting their corresponding points to 
the left side. Under CDF-PSP, all the points with high 
loss rates correspond to OD pairs with low historical 
demand or activity factors. 

This suggests that CDF-PSP provides better protection 
for OD pairs with high demand or high activity. This 
is very desirable from a service provider perspective 
because OD pairs with high demand or high activity 
typically carry traffic from large customers who pay the 
most and are the most sensitive to service interruptions. 

4) OD pair Loss Improvement: 

As mentioned in Section VI-C.2, CDF-PSP does not 
always result in a lower packet loss for every OD pair 
than Mean-PSP. This can be attributed to the different 
amounts of packets being marked in the high priority 
class for an OD pair under different policies. It is also 
possible that both PSP techniques may exhibit higher 
loss rates for some OD pair in some time interval, 
compared to No-PSP. This is because under either PSP 
scheme, under high load conditions, most of the network 
capacity is used to serve high priority packets, and any 
residual capacity is used to serve low priority packets. 


?The y-axis is cut off at 40,000 kb/s because only a few OD pairs 
exceeded that demand and all of them had less than 10% loss rate. 

3Due to space constraints, we only show the results for the US 
network, while the results are similar in the EU network. 
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Therefore packets that are marked as low priority will 
tend to have higher drop rates than under No-PSP, 
where all packets were treated equally. Therefore for 
an OD pair, if a large proportion of its offered load 
gets marked as low priority, and there is congestion 
on the path, in theory it could suffer more losses than 
under No-PSP. However, this should not be a common 
case, since the PSP bandwidth allocation is designed to 
accommodate the normal traffic demand of an OD pair 
in the high priority class, based on historical demands. 
In the following, we examine how often CDF-PSP has 
better performance than either No-PSP or Mean-PSP. 
For both No-PSP and Mean-PSP, we determine for 
each OD pair the percentage of the 48 attack time 
intervals when the packet loss rate was no less than the 
loss rate under CDF-PSP. We plot the complementary 
cumulative distribution function (CCDF) of this value 
across all crossfire OD pairs with demand at any of the 
48 attack time intervals, for No-PSP and Mean-PSP in 
Figure 16. For each curve, a given point (x,y) in the 
figure indicates that for y percent of the crossfire OD 
pairs, the loss rates are greater than or equal to that under 
CDF-PSP in at least x percent of the time intervals. The 
graphs indicate that CDF-PSP outperforms both No-PSP 
and Mean-PSP for most OD pairs in a large proportion 
of the time intervals. Compared to No-PSP, for the EU 
network, under CDF-PSP, 90.72% of the OD pairs have 
equal or lower loss rates in all 48 time intervals, and 98% 
of the OD pairs have lower loss rates in at least 93.75% 
of the time intervals. For the same network, compared 
to Mean-PSP, CDF-PSP resulted in equal or lower loss 
rates at all 48 time intervals for 81.27% of the OD pairs. 


D. Performance under scaled attacks 


Given the growing penetration of broadband connec- 
tions and the ever-increasing availability of large armies 
of botnets “for hire”, it is important to understand the 
effectiveness of the PSP techniques with respect to in- 
creasing attack intensity. To study this, for each network, 
we vary the intensity of the attack matrix by scaling the 
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Fig. 16. CCDF of percentage of time that the loss rate for a crossfire Fig. 17. The time-averaged mean crossfire OD-pair packet loss rate 


OD pair under No-PSP and Mean-PSP exceeds that under CDF-PSP 


demand of every attack flow by a factor ranging from 0 
to 3, in steps of size 0.25. For each value of the scaling 
factor, we measure the time-averaged Mean OD packet 
loss rate of crossfire OD pairs (defined in Section VI- 
B.2) across eight 1-min. time intervals, equally spaces 
across 24 hours. Figure 17 shows that the loss rate 
under No-PSP increases much faster than under Mean- 
PSP and CDF-PSP, as the attack intensity increases. This 
is because under No-PSP, all the normal traffic packets 
have to compete for limited bandwidth resources with 
the attack traffic, while with our protection scheme only 
normal traffic marked in low priority class is affected 
by the increasing attack. Therefore, even in the extreme 
case when the attack traffic demand is sufficient to clog 
all links, our protection scheme can still guarantee that 
the normal traffic marked in the high priority class goes 
through the network. Consequently, our PSP schemes 
are much less sensitive to the degree of congestion, as 
evident by the much slower growth of the drop rate. For 
example, in the US network, as the scale factor increases 
from 1 to 3, under No-PSP, the mean drop rate jumped 
from slightly above 20% to almost 40% . In comparison, 
under CDF-PSP, the mean loss rate increases very little 
from less than 3% to 4% over the same range of attack 
intensities. The trends demonstrate that across the range 
of scaling factor values, both the PSP schemes are very 
effective in mitigating collateral damage by keeping loss 
rates low, with CDF-PSP having an edge over Mean-PSP. 


E. Summary of Results 


In this section, we summarize the main findings 
from the evaluation of our PSP methods on two large 
backbone networks. First, we show that the potential for 
collateral damage is significant in that even when a small 
number of OD pairs are attacked, a majority of the OD 
pairs in a network can be substantially impacted. For 
both the US and EU backbones, we observed that the 
percentage of OD pairs impacted is surprisingly large, 
95.5% and 83.5%, even though the attacks were directed 
over only 1.2% and 0.1% of the OD pairs, respectively. 
Comparing to no protection, Mean-PSP and CDF-PSP 
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as the attack volume scaling factor increases from 0 to 3. 


significantly reduced the total packet loss up to 97.58%, 
the mean OD pair packet loss rates up to 95.92%, and 
the number of crossfire OD pairs with packet loss by 
90.36%. Further, CDF-PSP substantially improved over 
Mean-PSP by reducing the loss rate across all evaluation 
matrices. Specifically, CDF-PSP reduced the total packet 
loss of Mean-PSP up to 53.09% in the US network 
and up to 41.58% in the EU network, and CDF-PSP 
reduced the number of OD pairs with packet loss by up 
to 59.30% in the US network and up to 47.60% in the 
EU network. Finally, we show PSP can maintain low 
packet loss rates even when the intensity of attacks is 
increased significantly. 


VII. CONCLUSION 


PSP provides network operators with a broad first line 
of proactive defense against DDoS attacks, significantly 
reducing the impact of sudden bandwidth-based attacks 
on a service provider network. The proactive surge 
protection is achieved by providing bandwidth isolation 
between traffic flows. This isolation is achieved through 
a combination of traffic data collection, bandwidth al- 
location of network resources, metering and tagging 
of packets at the network perimeter, and preferential 
dropping of packets inside the network. Among its 
salient features, PSP is readily deployable using existing 
router mechanisms, and PSP does not rely on any 
unauthenticated packet header information. The latter 
feature makes the solution resilient to evading attack 
schemes that launch many seemingly legitimate TCP 
connections with spoofed IP addresses and port numbers. 
This is due to the fact that PSP focuses on protecting 
traffic between different ingress-egress interface pairs 
in a provider network, and both the ingress and egress 
interface of an IP datagram can be directly determined 
by the network operator. By taking into consideration 
traffic variability observed in traffic measurements, our 
proactive protection solution can ensure the maximiza- 
tion of the acceptance probability of each flow in a 
max-min fair manner, or equivalently the minimization 
of the drop probability in a min-max fair manner. Our 
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extensive evaluation results across two large commercial 
backbone networks, using both distributed and targeted 
attack scenarios, show that up to 95.5% of the network 
could suffer collateral damage without protection, but 
our solution was able to significantly reduce the amount 
of collateral damage by up to 97.58% in terms of the 
number of packets dropped and 90.36% in terms of the 
number of flows with packet loss. In addition, we show 
that PSP can maintain low packet loss rates even when 
the intensity of attacks is increased significantly. 
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Abstract 


Botnets are now the key platform for many Internet 
attacks, such as spam, distributed denial-of-service 
(DDoS), identity theft, and phishing. Most of the current 
botnet detection approaches work only on specific 
botnet command and control (C&C) protocols (e.g., 
IRC) and structures (e.g., centralized), and can become 
ineffective as botnets change their C&C techniques. In 
this paper, we present a general detection framework that 
is independent of botnet C&C protocol and structure, 
and requires no a priori knowledge of botnets (such as 
captured bot binaries and hence the botnet signatures, 
and C&C server names/addresses). We start from the 
definition and essential properties of botnets. We define 
a botnet as a coordinated group of malware instances 
that are controlled via C&C communication channels. 
The essential properties of a botnet are that the bots 
communicate with some C&C servers/peers, perform 
malicious activities, and do so in a similar or correlated 
way. Accordingly, our detection framework clusters 
similar communication traffic and similar malicious 
traffic, and performs cross cluster correlation to identify 
the hosts that share both similar communication patterns 
and similar malicious activity patterns. These hosts 
are thus bots in the monitored network. We have 
implemented our BotMiner prototype system and 
evaluated it using many real network traces. The results 
show that it can detect real-world botnets (IRC-based, 
HTTP-based, and P2P botnets including Nugache and 
Storm worm), and has a very low false positive rate. 


1 Introduction 


Botnets are becoming one of the most serious threats to 
Internet security. A botnet is a network of compromised 
machines under the influence of malware (bot) code. The 
botnet is commandeered by a “botmaster’ and utilized as 
“resource” or “platform” for attacks such as distributed 
denial-of-service (DDoS) attacks, and fraudulent activi- 
ties such as spam, phishing, identity theft, and informa- 
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tion exfiltration. 


In order for a botmaster to command a botnet, there 
needs to be a command and control (C&C) channel 
through which bots receive commands and coordinate 
attacks and fraudulent activities. The C&C channel is 
the means by which individual bots form a botnet. Cen- 
tralized C&C structures using the Internet Relay Chat 
(IRC) protocol have been utilized by botmasters for a 
long time. In this architecture, each bot logs into an IRC 
channel, and seeks commands from the botmaster. Even 
today, many botnets are still designed this way. Quite a 
few botnets, though, have begun to use other protocols 
such as HTTP [8, 14, 24, 39], probably because HTTP- 
based C&C communications are more stealthy given 
that Web traffic is generally allowed in most networks. 
Although centralized C&C structures are effective, they 
suffer from the single-point-of-failure problem. For ex- 
ample, if the IRC channel (or the Web server) is taken 
down due to detection and response efforts, the botnet 
loses its C&C structure and becomes a collection of 
isolated compromised machines. Recently, botmasters 
began using peer-to-peer (P2P) communication to avoid 
this weakness. For example, Nugache [28] and Storm 
worm [18,23] (a.k.a. Peacomm) are two representative 
P2P botnets. Storm, in particular, distinguishes itself 
as having infected a large number of computers on the 
Internet and effectively becoming one of the “world’s top 
super-computers” [27] for the botmasters. 

Researchers have proposed a few approaches [7, 17, 
19, 20, 26, 29, 35, 40] to detect the existence of botnets 
in monitored networks. Almost all of these approaches 
are designed for detecting botnets that use IRC or HTTP 
based C&C [7, 17,26, 29,40]. For example, Rishi [17] 
is designed to detect IRC botnets using known IRC bot 
nickname patterns as signatures. In [26,40], network 
flows are clustered and classified according to IRC-like 
traffic patterns. Another more recent system, BotSniffer, 
[20] is designed mainly for detecting C&C activities 
with centralized servers (with protocols such as IRC 
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and HTTP!). One exception is perhaps BotHunter [19], 
which is capable of detecting bots regardless of the C&C 
structure and network protocol as long as the bot be- 
havior follows a pre-defined infection life cycle dialog 
model. 

However, botnets are evolving and can be quite flexi- 
ble. We have witnessed that the protocols used for C&C 
evolved from IRC to others (e.g., HTTP [8, 14, 24, 39]), 
and the structure moved from centralized to distributed 
(e.g., using P2P [18, 28]). Furthermore, a botnet dur- 
ing its lifetime can also change its C&C server address 
frequently, e.g., using fast-flux service networks [22]. 
Thus, the aforementioned detection approaches designed 
for IRC or HTTP based botnets may become ineffective 
against the recent/new botnets. Even BotHunter may fail 
as soon as botnets change their infection model(s). 

Therefore, we need to develop a next generation botnet 
detection system, which should be independent of the 
C&C protocol, structure, and infection model of botnets, 
and be resilient to the change of C&C server addresses. 
In addition, it should require no a priori knowledge of 
specific botnets (such as captured bot binaries and hence 
the botnet signatures, and C&C server names/addresses). 


In order to design such a general detection system that 
can resist evolution and changes in botnet C&C tech- 
niques, we need to study the intrinsic botnet communi- 
cation and activity characteristics that remain detectable 
with the proper detection features and algorithms. We 
thus start with the definition and essential properties of a 
botnet. We define a botnet as: 

“A coordinated group of malware instances that are 
controlled via C&C channels”. 

The term “malware” means these bots are used to 
perform malicious activities. According to [44], about 
53% of botnet activity commands observed in thousands 
of real-world ITRC-based botnets are related to scan (for 
the purpose of spreading or DDoS”), and about 14.4% 
are related to binary downloading (for the purpose of 
malware updating). In addition, most of HTTP-based 
and P2P-based botnets are used to send spam [18, 39]. 
The term “controlled” means these bots have to con- 
tact their C&C servers to obtain commands to carry out 
activities, e.g., to scan. In other words, there should 
be communication between bots and C&C servers/peers 
(which can be centralized or distributed). Finally, the 
term “coordinated group” means that multiple (at least 
two) bots within the same botnet will perform similar or 
correlated C&C communications and malicious activi- 


'BotSniffer could be extended to support other protocol based 
C&C, if the corresponding protocol matchers are added. 

?For spreading, the scans usually span many different hosts (within 
a subnet) indicated by the botnet command. For DDoS, usually there 
are numerous connection attempts to a specific host. In both cases, the 
traffic can be considered as scanning related. 
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ties. If the botmaster commands each bot individually 
with a different command/channel, the bots are nothing 
but some isolated/unrelated infections. That is, they do 
not function as a botnet according to our definition and 
are out of the scope of this work’. 

We propose a general detection framework that is 
based on these essential properties of botnets. This 
framework monitors both who is talking to whom that 
may suggest C&C communication activities and who is 
doing what that may suggest malicious activities, and 
finds a coordinated group pattern in both kinds of activi- 
ties. More specifically, our detection framework clusters 
similar communication activities in the C-plane (C&C 
communication traffic), clusters similar malicious activ- 
ities in the A-plane (activity traffic), and performs cross 
cluster correlation to identify the hosts that share both 
similar communication patterns and similar malicious 
activity patterns. These hosts, according to the botnet 
definition and properties discussed above, are bots in the 
monitored network. 

This paper makes the following main contributions. 


e We develop a novel general botnet detection frame- 
work that is grounded on the definition and essential 
properties of botnets. Our detection framework 
is thus independent of botnet C&C protocol and 
structure, and requires no a priori knowledge (e.g., 
C&C addresses/signatures) of specific botnets. It 
can detect both centralized (e.g., IRC, HTTP) and 
current (and possibly future) P2P based botnets. 


e We define a new “aggregated communication flow” 
(C-flow) record data structure to store aggregated 
traffic statistics, and design a new layered clustering 
scheme with a set of traffic features measured on 
the C-flow records. Our clustering scheme can 
accurately and efficiently group similar C&C traffic 
patterns. 


e We build a BotMiner prototype system based on our 
general detection framework, and evaluate it with 
multiple real-world network traces including nor- 
mal traffic and several real-world botnet traces that 
contain IRC, HTTP and P2P-based botnet traffic 
(including Nugache and Storm). The results show 
that BotMiner has a high detection rate and a low 
false positive rate. 


The rest of the paper is organized as follows. In 
Section 2, we describe the assumptions, objectives, ar- 
chitecture of our BotMiner detection framework, and its 


3One can still use our complementary system, BotHunter [19], to 
detect individual bots. In this paper, we focus on the detection of a 
botnet. We further clarify our assumptions in Section 2.1 and address 
limitations in Section 4. 
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detection algorithms and implementation. In Section 3, 
we describe our evaluation on various real-world net- 
work traces. In Section 4, we discuss current limitations 
and possible solutions. We review the related work in 
Section 5 and conclude in Section 6. 


2 BotMiner Detection Framework and Im- 
plementation 


2.1 Problem Statement and Assumptions 


According to the definition given above, a botnet is char- 
acterized by both a C&C communication channel (from 
which the botmaster’s commands are received) and ma- 
licious activities (when commands are executed). Some 
other forms of malware (e.g., worms) may perform mali- 
cious activities, but they do not connect to a C&C chan- 
nel. On the other hand, some normal applications (e.g., 
IRC clients and normal P2P file sharing software) may 
show communication patterns similar to a botnet’s C&C 
channel, but they do not perform malicious activities. 

Figure 1 illustrates two typical botnet structures, 
namely centralized and P2P. The bots receive commands 
from the botmaster using a push or pull mechanism [20] 
and execute the assigned tasks. 

The operation of a centralized botnet is relatively easy 
and intuitive [20], whereas this is not necessarily true 
for P2P botnets. Therefore, here we briefly illustrate an 
example of a typical P2P-based botnet, namely Storm 
worm [18,23]. In order to issue commands to the bots, 
the botmaster publishes/shares command files over the 
P2P network, along with specific search keys that can 
be used by the bots to find the published command 
files. Storm bots utilize a pull mechanism to receive 
the commands. Specifically, each bot frequently contacts 
its neighbor peers searching for specific keys in order to 
locate the related command files. In addition to search 
operations, the bots also frequently communicate with 
their peers and send keep-alive messages. 

In both centralized and P2P structures, bots within 
the same botnet are likely to behave similarly in terms 
of communication patterns. This is largely due to the 
fact that bots are non-human driven, pre-programmed to 
perform the same routine C&C logic/communication as 
coordinated by the same botmaster. In the centralized 
structure, even if the address of the C&C server may 
change frequently (e.g., by frequently changing the A 
record of a Dynamic DNS domain name), the C&C 
communication patterns remain unchanged. In the case 
of P2P-based botnets, the peer communications (e.g., to 
search for commands or to send keep-alive messages) 
follow a similar pattern for all the bots in the botnet, 
although each bot may have a different set of neighbor 
peers and may communicate on different ports. 

Regardless of the specific structure of the botnet (cen- 
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tralized or P2P), members of the same botnet (i.e., the 
bots) are coordinated through the C&C channel. In gen- 
eral, a botnet is different from a set of isolated individual 
malware instances, in which each different instance is 
used for a totally different purpose. Although in an 
extreme case a botnet can be configured to degenerate 
into a group of isolated hosts, this is not the common 
case. In this paper, we focus on the most typical and 
useful situation in which bots in the same botnet perform 
similar/coordinated activities. To the best of our knowl- 
edge, this holds true for most of the existing botnets 
observed in the wild. 

To summarize, we assume that bots within the same 
botnet will be characterized by similar malicious activ- 
ities, as well as similar C&C communication patterns. 
Our assumption holds even in the case when the bot- 
master chooses to divide a botnet into sub-botnets, for 
example by assigning different tasks to different sets of 
bots. In this case, each sub-botnet will be characterized 
by similar malicious activities and C&C communications 
patterns, and our goal is to detect each sub-botnet. In 
Section 4 we provide a detailed discussion on possible 
evasive botnets that may violate our assumptions. 


2.2 Objectives 


The objective of BotMiner is to detect groups of compro- 
mised machines within a monitored network that are part 
of a botnet. We do so by passively analyzing network 
traffic in the monitored network. 

Note that we do not aim to detect botnets at the very 
moment when victim machines are compromised and 
infected with malware (bot) code. In many cases these 
events may not be observable by passively monitoring 
network traffic. For example, an already infected lap- 
top may be carried in and connected to the monitored 
network, or a user may click on a malicious email at- 
tachment and get infected. In this paper we are not 
concerned with the way internal hosts become infected 
(e.g., by malicious email attachments, remote exploiting, 
and Web drive-by download). We focus on the detection 
of groups of already compromised machines inside the 
monitored network that are part of a botnet. 

Our detection approach meets several goals: 


e it is independent of the protocol and structure used 
for communicating with the botmaster (the C&C 
channel) or peers, and is resistant to changes in the 
location of the C&C server(s). 


e it is independent of the content of the C&C com- 
munication. That is, we do not inspect the content 
of the C&C communication itself, because C&C 
could be encrypted or use a customized (obscure) 
protocol. 
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Figure 1: Possible structures of a botnet: (a) centralized; (b) peer-to-peer. 
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Figure 2: Architecture overview of our BotMiner detection framework. 


e it generates a low number of false positives and false 
negatives. 


e the analysis of network traffic employs a reasonable 
amount of resources and time, making detection 
relatively efficient. 


2.3. Architecture of BotMiner Detection Framework 


Figure 2 shows the architecture of our BotMiner detec- 
tion system, which consists of five main components: 
C-plane monitor, A-plane monitor, C-plane clustering 
module, A-plane clustering module, and cross-plane cor- 
relator. 

The two traffic monitors in C-plane and A-plane can 
be deployed at the edge of the network examining traffic 
between internal and external networks, similar to BotH- 
unter [19] and BotSniffer [20]*. They run in parallel 
and monitor the network traffic. The C-plane monitor 
is responsible for logging network flows in a format 
suitable for efficient storage and further analysis, and 


4 All these tools can also be deployed in LANs. 
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the A-plane monitor is responsible for detecting suspi- 
cious activities (e.g., scanning, spamming, and exploit 
attempts). The C-plane clustering and A-plane clustering 
components process the logs generated by the C-plane 
and A-plane monitors, respectively. Both modules ex- 
tract a number of features from the raw logs and apply 
clustering algorithms in order to find groups of machines 
that show very similar communication (in the C-plane) 
and activity (in the A-plane) patterns. Finally, the cross- 
plane correlator combines the results of the C-plane and 
A-plane clustering and makes a final decision on which 
machines are possibly members of a botnet. In an ideal 
situation, the traffic monitors should be distributed on the 
Internet, and the monitor logs are reported to a central 
repository for clustering and cross-plane analysis. 

In our current prototype system, traffic monitors are 
implemented in C for the purpose of efficiency (working 
on real-time network traffic). The clustering and corre- 
lation analysis components are implemented mainly in 
Java and R (http: //www.r-project.org/), and 
they work offline on logs generated from the monitors. 

The following sections present the details of the design 
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and implementation of each component of the detection 
framework. 


2.4 Traffic Monitors 


C-plane Monitor. The C-plane monitor captures net- 
work flows and records information on who is talking to 
whom. Many network routers support the logging of net- 
work flows, e.g., Cisco (www. cisco.com) and Juniper 
(www. juniper.net) routers. Open source solutions 
like Argus (Audit Record Generation and Utilization 
System, http: //www.qosient .com/argus) are 
also available. We adapted an efficient network flow cap- 
ture tool developed at our research lab, i.e., fcapture 
>, which is based on the Judy library (http: //judy. 
sourceforge.net/). Currently, we limit our inter- 
est to TCP and UDP flows. Each flow record contains the 
following information: time, duration, source IP, source 
port, destination IP, destination port, and the number of 
packets and bytes transfered in both directions. The main 
advantage of our tool is that it works very efficiently 
on high speed networks (very low packet loss ratio on 
a network with 300Mbps traffic), and can generate very 
compact flow records that comply with the requirement 
for further processing by the C-plain clustering mod- 
ule. As a comparison, our flow capturing tool generates 
compressed records ranging from 200MB to 1GB per 
day from the traffic in our academic network, whereas 
Argus generates around 36GB of compressed binary flow 
records per day on average (without recording any pay- 
load information). Our tool makes the storage of several 
weeks or even months of flow data feasible. 


A-plane Monitor. The A-plane monitor logs informa- 
tion on who is doing what. It analyzes the outbound 
traffic through the monitored network and is capable 
of detecting several malicious activities that the internal 
hosts may perform. For example, the A-plane monitor 
is able to detect scanning activities (which may be used 
for malware propagation or DoS attacks), spamming, 
binary downloading (possibly used for malware update), 
and exploit attempts (used for malware propagation or 
targeted attacks). These are the most common and “use- 
ful” activities a botmaster may command his bots to 
perform [9, 33, 44]. 

Our A-plane monitor is built based on Snort [36], an 
open-source intrusion detection tool, for the purpose of 
convenience. We adapted existing intrusion detection 
techniques and implemented them as Snort pre-processor 
plug-ins or signatures. For scan detection we adapted 
SCADE (Statistical sCan Anomaly Detection Engine), 
which is a part of BotHunter [19] and available at [11]. 
Specifically, we mainly use two anomaly detection mod- 
ules: the abnormally-high scan rate and weighted failed 


5This tool will be released in open source soon. 
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connection rate. We use an OR combination rule, so 
that an event detected by either of the two modules 
will trigger an alert. In order to detect spam-related 
activities, we developed a new Snort plug-in. We focused 
on detecting anomalous amounts of DNS queries for 
MX records from the same source IP and the amount 
of SMTP connections initiated by the same source to 
mail servers outside the monitored network. Normal 
clients are unlikely to act as SMTP servers and therefore 
should rely on the internal SMTP server for sending 
emails. Use of many distinct external SMTP servers for 
many times by the same internal host is an indication 
of possible malicious activities. For the detection of 
PE (Portable Executable) binary downloading we used 
an approach similar to PEHunter [42] and BotHunter’s 
egg download detection method [19]. One can also use 
specific exploit rules in BotHunter to detect internal hosts 
that attempt to exploit external machines. Other state-of- 
the-art detection techniques can be easily added to our 
A-plane monitoring to expand its ability to detect typical 
botnet-related malicious activities. 

It is important to note that A-plane monitoring alone 
is not sufficient for botnet detection purpose. First of 
all, these A-plane activities are not exclusively used in 
botnets. Second, because of our relatively loose design 
of A-plane monitor (for example, we will generate a 
log whenever there is a PE binary downloading in the 
network regardless of whether the binary is malicious or 
not), relying on only the logs from these activities will 
generate a lot of false positives. This is why we need to 
further perform A-plane clustering analysis as discussed 
shortly in Section 2.6. 


2.5 C-plane Clustering 


C-plane clustering is responsible for reading the logs 
generated by the C-plane monitor and finding clusters 
of machines that share similar communication patterns. 
Figure 3 shows the architecture of the C-plane clustering. 

First of all, we filter out irrelevant (or uninterest- 
ing) traffic flows. This is done in two steps: basic- 
filtering and white-listing. It is worth noting that these 
two steps are not critical for the proper functioning of the 
C-plane clustering module. Nonetheless, they are useful 
for reducing the traffic workload and making the actual 
clustering process more efficient. In the basic-filtering 
step, we filter out all the flows that are not directed from 
internal hosts to external hosts. Therefore, we ignore the 
flows related to communications between internal hosts° 
and flows initiated from external hosts towards internal 
hosts (filter rule 1, denoted as F1). We also filter out 
flows that are not completely established (filter rule 2, 


®If the C-plane monitor is deployed at the edge router, these traffic 
will not be seen. However, if the monitor is deployed/tested in a LAN, 
then this filtering can be used. 
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Figure 3: C-plane clustering. 


denoted as F2), i.e., those flows that only contain one- 
way traffic. These flows are mainly caused by scanning 
activity (e.g., when a host sends SYN packets without 
completing the TCP hand-shake). In white-list filtering, 
we filter out those flows whose destinations are well 
known as legitimate servers (e.g., Google, Yahoo!) 
that will unlikely host botnet C&C servers. This filter 
rule is denoted as F3. In our current evaluation, the white 
list is based on the US top 100 and global top 100 most 
popular websites from Alexa.com. 

After basic-filtering and white-listing, we further re- 
duce the traffic workload by aggregating related flows 
into communication flows (C-flows) as follows. Given 
an epoch F (typically one day), all m TCP/UDP flows 
that share the same protocol (TCP or UDP), source IP, 
destination IP and port, are aggregated into the same 
C-flow c; = {f;}j=1.m, where each f; is a single 
TCP/UDP flow. Basically, the set {c;};-1..n of all the 
n C-flows observed during F tells us “who was talking 
to whom”, during that epoch. 


2.5.1 Vector Representation of C-flows 


The objective of C-plane clustering is to group hosts 
that share similar communication flows. This can be 
accomplished by clustering the C-flows. In order to 
apply clustering algorithms to C-flows we first need to 
translate them in a suitable vector representation. We 
extract a number of statistical features from each C-flow 
c;, and translate them into d-dimensional pattern vectors 
Dp; € IR. We can describe this task as a projection 
function F : C-plane — R¢. The projection function F 
is defined as follows. Given a C-flow c;, we compute the 
discrete sample distribution of (currently) four random 
variables: 


1. the number of flows per hour (fph). fph is computed 
by counting the number of TCP/IP flows in c; that 
are present for each hour of the epoch EF. 


2. the number of packets per flow (ppf). ppf is com- 
puted by summing the total number of packets sent 
within each TCP/UDP flow in c;. 


3. the average number of bytes per packets (bpp). For 
each TCP/UDP flow f; © c; we divide the overall 


17th USENIX Security Symposium 


number of bytes transfered within f; by the number 
of packets sent within f;. 


4. the average number of bytes per second (bps). bps 
is computed as the total number of bytes transfered 
within each f; € c; divided by the duration of f;. 


An example of the results of this process is shown in 
Figure 4, where we select a random client from a real 
network flow log (we consider a one-day epoch) and il- 
lustrate the features extracted from its visits to Google. 

Given the discrete sample distribution of each 
of these four random variables, we compute an 
approximate version of it by means of a binning 
technique. For example, in order to approximate the 
distribution of fph we divide the x-axis in 13 intervals 
as (0, ky], (kx, ka], oeey (ki, oo). The values ky, “9 kyo 
are computed as follows. First, we compute the overall 
discrete sample distribution of fph considering all the 
C-flows in the traffic for an epoch /. Then, we compute 


the quantiles’ 95%, 410%, 915% 420% + 425%» 130% 140%» 


150% 160% + 970% 980%» 990% Of the — obtained 
distribution, and we set ki = qsy%, ko = qiox, 


kz = disy%, etc. Now, for each C-flow we can describe 
its fph (approximate) distribution as a vector of 13 
elements, where each element 7 represents the number 
of times fph assumed a value within the corresponding 
interval (k;_1, k;]. We also apply the same algorithm for 
ppf, bpp, and bps, and therefore we map each C-flow 
c; into a pattern vector p; of d = 52 elements. Figure 
5 shows the scaled visiting pattern extracted form the 
same C-flow shown in Figure 4. 


2.5.2 Two-step Clustering 


Since bots belonging to the same botnet share simi- 
lar behavior (from both the communication and activity 
points of view) as we discussed before, our objective is 
to look for groups of C-flows that are similar to each 
other. Intuitively, pattern vectors that are close to each 
other in R@ represent C-flows with similar communi- 
cation patterns in the C-plane. For example, suppose 
two bots of the same botnet connect to two different 


7The quantile qj of a random variable X is the value q for which 
P(X <q=I%. 
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Figure 4: Visit pattern (shown in distribution) to Google from a randomly chosen normal client. 
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Figure 5: Scaled visit pattern (shown in distribution) to Goog1e for the same client in Figure 4. 


C&C servers (because some botnets use multiple C&C 
servers). Although the connections from both bots to 
the C&C servers will be in different C-flows because 
of different source/destination pairs, their C&C traffic 
characteristics should be similar. That is, in R%, these 
C-flows should be found as being very similar. In order 
to find groups of hosts that share similar communication 
patterns, we apply clustering techniques on the dataset 
D = {p; = F(c:)}in1... of the pattern vector rep- 
resentations of C-flows. Clustering techniques perform 
unsupervised learning. Typically, they aim at finding 
meaningful groups of data points in a given feature space 
F. The definition of “meaningful clusters” is application- 
dependent. Generally speaking, the goal is to group the 
data into clusters that are both compact and well sepa- 
rated from each other, according to a suitable similarity 
metric defined in the feature space F [25]. 


Clustering C-flows is a challenging task because |D|, 
the cardinality of D, is often large even for moderately 
large networks, and the dimensionality d of the feature 
space is also large. Furthermore, because the percentage 
of machines in a network that are infected by bots is 
generally small, we need to separate the few botnet- 
related C-flows from a large number of benign C-flows. 
All these make clustering of C-flows very expensive. 


In order to cope with the complexity of clustering of 
D, we solve the problem in several steps (currently in two 
steps), as shown in a simple form in Figure 6. At the first 
step, we perform coarse-grained clustering on a reduced 
feature space R“ , with d! < d, using a simple (i.e., non- 
expensive) clustering algorithm (we will explain below 
how we perform dimensionality reduction). The results 
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Figure 6: Two-step clustering of C-flows. 


of this first-step clustering is a set {C/}in1..4, of V1 
relatively large clusters. By doing so we subdivide the 
dataset D into smaller datasets (the clusters C/) that 
contain “clouds” of points that are not too far from each 
other. 


Afterwards, we refine this result by performing a 
second-step clustering on each different dataset C;/ 
using a simple clustering algorithm on the complete 
description of the C-flows in R¢ (i.e., we do not perform 
dimensionality reduction in the second-step clustering). 
This second step generates a set of y2 smaller and more 
precise clusters {Ci’}i=1. yo. 

We implement the first- and second-step clustering 
using the X-means clustering algorithm [31]. X-means 
is an efficient algorithm based on /-means [25], a very 
popular clustering algorithm. Different from /A’-means, 


17th USENIX Security Symposium — 145 


146 


the X-means algorithm does not require the user to 
choose the number KK of final clusters in advance. 
X-means runs multiple rounds of /’-means internally 
and performs efficient clustering validation using the 
Bayesian Information Criterion [31] in order to compute 
the best value of kK. X-means is fast and scales well 
with respect to the size of the dataset [31]. 

For the first-step (coarse-grained) clustering, we first 
reduce the dimensionality of the feature space from d = 
52 features (see Section 2.5.1) into d’ = 8 features by 
simply computing the mean and variance of the distribu- 
tion of fph, ppf, bpp, and bps for each C-flow. Then we 
apply the X-means clustering algorithm on the obtained 
representation of C-flows to find the coarse-grained clus- 
ters {C{};—1.»,. Since the size of the clusters {C/};=1..y, 
generated by the first-step clustering is relatively small, 
we can now afford to perform a more expensive analysis 
on each Ch. Thus, for the second-step clustering, we use 
all the d = 52 available features to represent the C-flows, 
and we apply the X -means clustering algorithm to refine 
the results of the first-step clustering. 

Of course, since unsupervised learning is a notoriously 
difficult task, the results of this two-step clustering algo- 
rithm may still be not perfect. As a consequence, the 
C-flows related to a botnet may be grouped into some 
distinct clusters, which basically represent sub-botnets. 
Furthermore, a cluster that contains mostly botnet or 
benign C-flows may also contain some “noisy” benign 
or botnet C-flows, respectively. However, we would like 
to stress the fact that these problems are not necessarily 
critical and can be alleviated by performing correlation 
with the results of the activity-plane (A-plane) clustering 
(see Section 2.7). 

Finally, we need to note that it is possible to bootstrap 
the clustering from A-plane logs. For example, one may 
apply clustering to only those hosts that appear in the A- 
plane logs (i.e., the suspicious activity logs). This may 
greatly reduce the workload of the C-plane clustering 
module, if speed is the main concern. Similarly, one 
may bootstrap the A-plane correlation from C-plane logs, 
e.g., by monitoring only clients that previously formed 
communication clusters, or by giving monitoring pref- 
erence to those clients that demonstrate some persistent 
C-flow communications (assuming botnets are used for 
long-term purpose). 


2.6 A-plane Clustering 


In this stage, we perform two-layer clustering on ac- 
tivity logs. Figure 7 shows the clustering process in 
A-plane. For the whole list of clients that perform at 
least one malicious activity during one day, we first 
cluster them according to the types of their activities 
(e.g., scan, spam, and binary downloading). This is 
the first layer clustering. Then, for each activity type, 
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Figure 7: A-plane clustering. 


we further cluster clients according to specific activity 
features (the second layer clustering). For scan activity, 
features could include scanning ports, that is, two clients 
could be clustered together if they are scanning the same 
ports. Another candidate feature could be the target 
subnet/distribution, e.g., whether the clients are scanning 
the same subnet. For spam activity, two clients could be 
clustered together if their SMTP connection destinations 
are highly overlapped. This might not be robust when 
the bots are configured to use different SMTP servers 
in order to evade detection. One can further consider 
the spam content if the whole SMTP traffic is captured. 
To cluster spam content, one may consider the similarity 
of embedded URLs that are very likely to be similar 
with the same botnet [43], SMTP connection frequency, 
content entropy, and the normalized compression dis- 
tance (NCD [5, 41]) on the entire email bodies. For 
outbound exploit activity, one can cluster two clients if 
they send the same type of exploit, indicated by the Snort 
alert SID. For binary downloading activity, two clients 
could be clustered together if they download similar 
binaries (because they download from the same URL 
as indicated in the command from the botmaster). A 
distance function between two binaries can be any string 
distance such as DICE used in [20] ®. 

In our current implementation, we cluster scanning 
activities according to the destination scanning ports. 
For spam activity clustering, because there are very few 
hosts that show spamming activities in our monitored 
network, we simply cluster hosts together if they perform 
spamming (i.e., using only the first layer clustering here). 
For binary downloading, we configure our binary down- 
loading monitor to capture only the first portion (packet) 
of the binary for efficiency reasons (if necessary, we 
can also capture the entire binary). We simply compare 


8Tm an extreme case that bots update their binaries from different 
URLs (and the binaries are packed to be polymorphic thus different 
from each other), one should unpack the binary using tools such as 
Polyunpack [37] before calculating the distance. One may also directly 
apply normalized compression distance (NCD [5,41]) on the original 
(maybe packed) binaries. 
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whether these early portions of the binaries are the same 
or not. In other words, currently, our A-plane clustering 
implementation utilizes relatively weak cluster features. 
In the future, we plan to implement clustering on more 
complex feature sets discussed above, which are more 
robust against evasion. However, even with the current 
weak cluster features, BotMiner already demonstrated 
high accuracy with a low false positive rate as shown in 
our later experiments. 


2.7 Cross-plane Correlation 


Once we obtain the clustering results from A-plane (ac- 
tivities patterns) and C-plane (communication patterns), 
we perform cross-plane correlation. The idea is to cross- 
check clusters in the two planes to find out intersections 
that reinforce evidence of a host being part of a botnet. In 
order to do this, we first compute a botnet score s(h) for 
each host h on which we have witnessed at least one kind 
of suspicious activity. We filter out the hosts that have 
a score below a certain detection threshold 6, and then 
group the remaining most suspicious hosts according to 
a similarity metric that takes into account the A-plane 


and C-plane clusters these hosts have in common. 

We now explain how the botnet score is computed for 
each host. Let 7 be the set of hosts reported in the output 
of the A-plane clustering module, and h € H. Also, let 
A = {Aj}i=1..m, be the set of mp A-clusters that 


contain h, and ch) = {Ci }i=1..n, be the set of np, C- 
clusters that contain h. We compute the botnet score for 
has 
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where A;, A; € A and Cy € C\), t(A;) is the type of 
activity cluster A; refers to (e.g., scanning or spamming), 
and w(A;) > 1 is an activity weight assigned to Aj. 
w(A;) assigns higher values to “strong” activities (e.g., 
spam and exploit) and lower values to “weak” activities 
(e.g., scanning and binary download). 

h will receive a high score if it has performed multiple 
types of suspicious activities, and if other hosts that 
were clustered with h also show the same multiple types 
of activities. For example, assume that h performed 
scanning and then attempted to exploit a machine outside 
the monitored network. Let A, be the cluster of hosts 
that were found to perform scanning and were grouped 
with fA in the same cluster. Also, let Ag be a cluster 
related to exploit activities that includes h and other 
hosts that performed similar activities. A larger overlap 
between A; and Az would result in a higher score being 
assigned to h. Similarly, if h belongs to A-clusters that 
have a large overlap with C-clusters, then it means that 
the hosts clustered together with h share similar activities 
as well as similar communication patterns. 
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Given a predefined detection threshold 0, we consider 
all the hosts h € H with s(h) > 0 as (likely) bots, 
and filter out the hosts whose scores do not exceed @. 
Now, let B C H be the set of detected bots, A®) = 
{Aj }i=1..m, be the set of A-clusters that each contains at 
least one bot h € B, and C\?) = {C;};=1..n, be the set 
of C-clusters that each contains at least one bot h € B. 
Also, let K®) = A®) UC) = {K} 1 mptng) 
be an ordered union/set of A- and C-clusters. We then 
describe each bot h € B as a binary vector b(h) € 
{0, LPR, whereby the 7-th element b; = 1 if h € 


K i. and b; = 0 otherwise. Given this representation, 
we can define the following similarity between bots h; 


and h; as 


mB mptng 
sim(hi,h;) -UK 1a =ePy4iC 3) Tea ay?) > 4), 
k=1 k=mpt+l 


(2) 
where we use b) = b(h;) and bY) = b(h,), for the sake 
of brevity. [(X)) is the indication se which equals 
to one when the boolean argument X is true, and equals 
to zero when X is false. The intuition behind this metric 
is that if two hosts appear in the same activity clusters 
and in at least one common C-cluster, they should be 
clustered together. 

This definition of similarity between hosts gives us the 
opportunity to apply hierarchical clustering. This allows 
us to build a dendrogram, i.e., a tree like graph (see 
Figure 8) that encodes the relationships among the bots. 
We use the Davies-Bouldin (DB) validation index [21] 
to find the best dendrogram cut, which produces the 
most compact and well separated clusters. The obtained 
clusters group bots in (sub-) botnets. Figure 8 shows 
a (hypothetical) example. Assuming that the best cut 
suggested by the DB index is the one at height 90, 
we would obtain two botnets, namely {hg, h3, hs}, and 
{ha, he, ho, he, hi, hz}. 
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Figure 8: Example of hierarchical clustering for botnet 
detection. 


In our current implementation, we simply set weight 
w(A;) = 1 for all ¢ and 6 = 0, which essentially 
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means that we will consider all hosts that appear in two 
different types of A-clusters and/or in both A- and C- 
clusters as suspicious candidates for further hierarchical 
clustering. 


3 Experiments 


To evaluate our BotMiner detection framework and pro- 
totype system, we have tested its performance on several 
real-world network traffic traces, including both (pre- 
sumably) normal data from our campus network and 
collected botnet data. 


3.1 Experiment Setup and Data Collection 


We set up traffic monitors to work on a span port mir- 
roring a backbone router at the campus network of the 
College of Computing at Georgia Tech. The traffic rate 
is typically 200Mbps-300Mbps at daytime. We ran the 
C-plane and A-plane monitors for a continuous 10-day 
period in late 2007. A random sampling of the net- 
work trace shows that the traffic is very diverse, contain- 
ing many normal application protocols, such as HTTP, 
SMTP, POP, FTP, SSH, NetBios, DNS, SNMP, IM 
(e.g., ICQ, AIM), P2P (e.g., Gnutella, Edonkey, 
bittorrent), and IRC. This serves as a good back- 
ground to test the false positives and detection perfor- 
mance on a normal network with rich application proto- 
cols. 

We have collected a total of eight different botnets 
covering IRC, HTTP and P2P. Table 1 lists the basic 
information about these traces. 

We re-used two IRC and two HTTP botnet traces 
introduced in [20], ie., V-Spybot, V-Sdbot, 
B-HTTP-I, and B-HTTP-IT. In short, V-Spybot 
and V-Sdbot are generated by executing modified bot 
code (Spybot and Sdbot [6]) in a fully controlled virtual 
network. They contain four Windows XP/2K IRC 
bot clients, and last several minutes. B-HTTP-I and 
B-HTTP-ITI are generated based on the description of 
Web-based C&C communications in [24,39]. Four bot 
clients communicate with a controlled server and execute 
the received command (e.g., spam). In B-HTTP-TI, 
the bot contacts the server periodically (about every 
five minutes) and the whole trace lasts for about 3.6 
hours. In B-HTTP-ITI, we have a more stealthy C&C 
communication where the bot waits a random time 
between zero to ten minutes each time before it visits 
the server, and the whole trace lasts for 19 hours. These 
four traces are renamed as Botnet-IRC-spybot, 
Botnet-IRC-sdbot, Botnet-HTTP-1, and 
Botnet -HTTP-2, respectively. In addition, we also 
generated a new IRC botnet trace that lasts for a longer 
time (a whole day) using modified Rbot [3] source code. 
Again this is generated in a controlled virtual network 
with four Windows clients and one IRC server. This 
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trace is labeled as Botnet -IRC-rbot. 


We also obtained a real-world IRC-based botnet C&C 
trace that was captured in the wild in 2004, labeled as 
Botnet-IRC-N. The trace contains about 7-minute 
IRC C&C communications, and has hundreds of bots 
connected to the IRC C&C server. The botmaster set 
the command “.scan.startall” in the TOPIC of 
the channel. Thus, every bot would begin to propagate 
through scanning once joining the channel. They report 
their successful transfer of binary to some machines, and 
also report the machines that have been exploited. We 
believe this could be a variant of Phatbot [6]. Although 
we obtained only the IRC C&C traffic, we hypothesize 
that the scanning activities are easy to detect given the 
fact that bots are performing scanning commands in 
order to propagate. Thus, we assume we have an A-plane 
cluster with the botnet members because we want to see 
if we can still capture C-plane clusters and obtain cross- 
plane correlation results. 


Finally, we obtained a real-world trace containing two 
P2P botnets, Nugache [28] and Storm [18,23]. The trace 
lasts for a whole day, and there are 82 Nugache bots and 
13 Storm bots in the trace. It was captured from a group 
of honeypots running in the wild in late 2007. Each 
instance is running in Wine (an open source implementa- 
tion of the Windows API on top of Unix/Linux) instead 
of a virtual or physical machine. Such a set-up is known 
as winobot [12] and is used by researchers to track bot- 
nets. By using a lightweight emulation environment 
(Wine), winobots can run hundreds and thousands of 
black-box instances of a given malware. This allows one 
to participate in a P2P botnet en mass. Nugache is a TCP- 
based P2P bot that performs encrypted communications 
on port 8. Storm, originating in January of 2007, is 
one of the very few known UDP based P2P bots. It 
is based on the Kademlia [30] protocol and makes use 
of the Overnet network [2] to locate related data (e.g., 
commands). Storm is well-known as a spam botnet with 
a huge number of infected hosts [27]. In the implemen- 
tation of winobot, several malicious capabilities such as 
sending spam are disabled for legality reason, thus we 
can not observe spam traffic from the trace. However, 
we ran a full version of Storm on a VM-based honeypot 
(instead of Wine environment) and easily observed that 
it kept sending a huge amount of spam traffic, which 
makes the A-plane monitoring quite easy. Similarly, 
when running Nugache on a VM-based honeypot, we 
observed scanning activity to port 8 because it attempted 
to connect to its seeding peers but failed a lot of times 
(because the peers may not be available). Thus, we 
can detect and cluster A-plane activities for these P2P 
botnets. 
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[Trace [Size | Duration [__ Pt] TCP/UDP fiows 


Botnet-IRC-rbot 
Botnet-IRC-sdbot 


Botnet-IRC-spybot 
Botnet-IRC-N 


Botnet-P2P- sani 1.2G 24h 59, 322, 490 
Botnet-P2P-Nugache 1.2G 24h 59,322,490 





5, 295, 223 3 =p 
5,495,223 82 P2P 


Table 1: Collected botnet traces, covering IRC, HTTP and P2P based botnets. Storm and Nugache share the same 


file, so the statistics of the whole file are reported. 


[ Trace | Pkts =| ~~ Flows __[ Filteredby F1 | Filtered by F2 | Filtered by F3 a me mening | C- ae (TCP/UDP) 


5,178,375,514 

7,131,674,165 
9,701,255,613 
14,713,667,172 
11,177,174,133 


23,407,743 
29,632,407 
30,192,645 
35,590,583 
56,235,380 
75,037,684 
109,549,192 
96,364,123 
62,550,060 
83,433,368 


20,727,588 
27,861,853 
28,491,442 
33,434,985 
52,795,168 
71,397,138 
105,530,316 
92,413,010 
56,516,281 
77,601,188 


9,950,803,423 
10,039,87 1,506 
11,174,937,812 
9,504,436,063 
11,071,701,564 





2,964,948 


6,981 / 132,333 
2 691 / 96,261 
39,744 / 94,081 
73,021 / 167,146 
57,664 / 167,175 
59,383 / 176,210 
55,023 / 150,211 
56,246 / 179,838 
25,557 / 164,986 
25,436 / 154,294 


io es 
1,163,710 
1,520,739 
2,076,721 
2,124,044 
2,348,030 
2,312,130 
2,839,553 
2,839,395 


Table 2: C-plane traffic statistics, basic results of filtering, and C-flows. 


3.2 Evaluation Results 


Table 2 lists the statistics for the 10 days of network 
data we used to validate our detection system. For 
each day there are around 5-10 billion packets (TCP 
and UDP) and 30-100 million flows. Table 2 shows 
the results of several steps of filtering. The first step of 
filtering (filter rule F1) seems to be the most effective 
filter in terms of data volume reduction. F1 filters out 
those flows that are not initiated from internal hosts to 
external hosts, and achieves about 90% data volume 
reduction. The is because most of the flows are within 
the campus network (i.e., they are initiated from internal 
hosts towards other internal hosts). F2 further filters 
out around 0.5-3 million of non-completely-established 
flows. F3 further reduces the data volume by filtering 
out another 30,000 flows. After applying all the three 
steps of filtering, there are around | to 3 million flows 
left per day. We converted these remaining flows into C- 
flows as described in Section 2.5, and obtained around 
40,000 TCP C-flows and 130,000 UDP C-flows per day. 

We then performed two-step clustering on C-flows as 
described in Section 2.5. Table 3 shows the clustering 
results and false positives (number of clusters that are 
not botnets). The results for the first 5 days are related to 
both TCP and UDP traffic, whereas in the last 5 days we 
focused on only TCP traffic. 

It is easy to see from Table 3 that there are thousands 
of C-clusters generated each day. In addition, there 
are several thousand activity logs generated from A- 
plane monitors. Since we use relatively weak monitor 
modules, it is not surprising that we have this many ac- 
tivity logs. Many logs report binary downloading events 
or scanning activities. We cluster these activity logs 
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according to their activity features. As explained early, 
we are interested in groups of machines that perform 
activities in a similar/coordinated way. Therefore, we 
filter out the A-clusters that contain only one host. This 
simple filtering rule allows us to obtain a small number 
of A-clusters and reduce the overall false positive rate of 
our botnet detection system. 


Afterwards, we apply cross-plane correlation. We 
assume that the traffic we collected from our campus 
network is normal. In order to verify this assumption 
we used state-of-the-art botnet detection techniques like 
BotHunter [19] and BotSniffer [20]. Therefore, any 
cluster generated as a result of the cross-plane correlation 
is considered as a false positive cluster. It is easy to see 
from Table 3 that there are very few such false positive 
clusters every day (from zero to four). Most of these 
clusters contain only two clients (i.e., they induce two 
false positives). In three out of ten days no false positive 
was reported. In both Day-2 and Day-3, the cross- 
correlation produced one false positive cluster containing 
two hosts. Two false positive clusters were reported in 
each day from Day-5 to Day-8. In Day-4, the cross-plane 
correlation produced four false positive clusters. 


For each day of traffic, the last column of Table 3 
shows the false positive rate (FP rate), which is calcu- 
lated as the fraction of IP addresses reported in the false 
positive clusters over the total number of distinct normal 
clients appearing in that day. After further analysis we 
found that many of these false positives are caused by 
clients performing binary downloading from websites 
not present in our whitelist. In practice, the number 
of false positives may be reduced by implementing a 
better binary downloading monitor and clustering mod- 
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Step-1 C-clusters | Step-2 C-clusters |] A- pie es | A-clusters |] False Positive Clusters 


Day-1 (TCP/UDP) 
Day-2 (TCP/UDP) 
Day-3 (TCP/UDP) 
Day-4 (TCP/UDP) 
Day-5 (TCP/UDP) 
Day-6 (TCP) 
Day-7 (TCP) 
Day-8 (TCP) 
Day-9 (TCP) 
Day-10 (TCP) 


IRC-rbot 

IRC-sdbot 

IRC-spybot 

IRC-N 555 


YES 
YES 
YES 
YES a 


0 (0/878) 
0.003 (2/638) 
0.003 (2/692) 
0.01 (9/871) 

0.0048 (4/838) 
0.008 (7/877) 
0.006 (5/835) 

0.0091 (8/877) 

0 (0/714) 
0 (0/689) 


SCONNNIN ARKO 


HTTP-1 YES 100% in 0. a 
ere 
P2P-Storm 13 YES 13 100% 
mene [ws [|e | we |e | | 


Table 4: Botnet detection results using BotMiner. 


ule, e.g., by capturing the entire binary and performing 
content inspection (using either anomaly-based detection 
systems [38] or signature-based AV tools). 


In order to validate the detection accuracy of Bot- 
Miner, we overlaid botnet traffic to normal traffic. We 
consider one botnet trace at a time and overlay it to 
the entire normal traffic trace of Day-2. We simulate a 
near-realistic scenario by constructing the test dataset as 
follows. Let n be the number of distinct bots in the botnet 
trace we want to overlay to normal traffic. We randomly 
select n distinct IP addresses from the normal traffic trace 
and map them to the n IP addresses of the bots. That is, 
we replace an J P; of a normal machine with the JP; of 
a bot. In this way, we obtain a dataset of mixed normal 
and botnet traffic where a set of n machines show both 
normal and botnet-related behavior. Table 4 reports the 
detection results for each botnet. 


Table 4 shows that BotMiner is able to detect all eight 
botnets. We verified whether the members in the reported 
clusters are actually bots or not. For 6 out of 8 botnets, 
we obtained 100% detection rate, i.e., we successfully 
identified all the bots within the 6 botnets. For example, 
in the case of P2P botnets (Botnet -P2P-Nugache 
and Botnet-P2P-Storm), BotMiner correctly 
generated a cluster containing all the botnet members. 
In the case of Botnet-IRC-spybot, BotMiner 
correctly detected a cluster of bots. However, one of 
the bots belonging to the botnet was not reported in 
the cluster, which means that the detector generated 
a false negative. Botnet-IRC-N contains 259 bot 
clients. BotMiner was able to identify 258 of the bots in 
one cluster, whereas one of the bots was not detected. 
Therefore, in this case BotMiner had a detection rate of 
99.6%. 
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There were some cases in which BotMiner also gener- 
ated a false positive cluster containing two normal hosts. 
We verified that these two normal hosts in particular were 
also responsible for the false positives generated during 
the analysis of the Day-2 normal traffic (see Table 3). 

As we can see, BotMiner performs quite well in our 
experiments, showing a very high detection rate with rel- 
atively few false positives in real-world network traces. 


4 Limitations and Potential Solutions 


Like any intrusion/anomaly detection system, BotMiner 
is not perfect or complete. It is likely that once ad- 
versaries know our detection framework and implemen- 
tation, they might find some ways to evade detection, 
e.g., by evading the C-plane and A-plane monitoring 
and clustering, or the cross-plane correlation analysis. 
We now address these limitations and discuss possible 
solutions. 


4.1 Evading C-plane Monitoring and Clustering 


Botnets may try to utilize a legitimate website (e.g., 
Google) for their C&C purpose in attempt to evade 
detection. Evasion would be successful in this case if 
we whitelisted such legitimate websites to reduce the 
volume of monitored traffic and improve the efficiency of 
our detection system. However, if a legitimate website, 
say Google, is used as a means to locate a secondary 
URL for actual command hosting or binary download- 
ing, botnets may not be able to hide this secondary 
URL and the corresponding communications. Therefore, 
clustering of network traffic towards the server pointed 
by this secondary URL will likely allow us to detect the 
bots. Also, whitelisting is just an optional operation. One 
may easily choose not to use whitelisting to avoid such 
kind of evasion attempts (of course, in this case one may 
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face the trade-off between accuracy and efficiency). 

Botnet members may attempt to intentionally manip- 
ulate their communication patterns to evade our C-plane 
clustering. The easiest thing is to switch to multiple C&C 
servers. However, this does not help much to evade our 
detection because such peer communications could still 
be clustered together just like how we cluster P2P com- 
munications. A more advanced way is to randomize each 
individual communication pattern, for example by ran- 
domizing the number of packets per flow (e.g., by inject- 
ing random packets in a flow), and the number of bytes 
per packet (e.g., by padding random bytes in a packet). 
However, such randomization may introduce similarities 
among botnet members if we measure the distribution 
and entropy of communication features. Also, this ran- 
domization may raise suspicion because normal user 
communications may not have such randomized patterns. 
Advanced evasion may be attempted by bots that try 
to mimic the communication patterns of normal hosts, 
in a way similar to polymorphic blending attacks [15]. 
Furthermore, bots could use covert channes [1] to hide 
their actual C&C communications. We acknowledge 
that, generally speaking, communication randomization, 
mimicry attacks and covert channel represent limitations 
for all traffic-based detection approaches, including Bot- 
Miner’s C-plane clustering technique. By incorporating 
more detection features such as content inspection and 
host level analysis, the detection system may make eva- 
sion more difficult. 

Finally, we note that if botnets are used to perform 
multiple tasks (in A-plane), we may still detect them 
even when they can evade C-plane monitoring and anal- 
ysis. By using the scoring algorithm described in Section 
2.7, we can perform cross clustering analysis among 
multiple activity clusters (in A-plane) to accumulate the 
suspicious score needed to claim the existence of bot- 
nets. Thus, we may even not require C-plane analysis if 
there is already a strong cross-cluster correlation among 
different types of malicious activities in A-plane. For 
example, if the same set of hosts involve several types 
of A-plane clusters (e.g., they send spams, scan others, 
and/or download the same binaries), they can be reported 
as botnets because those behaviors, by themselves, are 
highly suspicious and most likely indicating botnets be- 
haviors [19, 20]. 


4.2 Evading A-plane Monitoring and Clustering 


Malicious activities of botnets are unlikely or relatively 
hard to change as long as the botmaster wants the botnets 
to perform “useful” tasks. However, the botmaster can 
attempt to evade BotMiner’s A-plane monitoring and 
clustering in several ways. 

Botnets may perform very stealthy malicious activities 
in order to evade the detection of A-plane monitors. For 
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example, they can scan very slowly (e.g., send one scan 
per hour), send spam very slowly (e.g., send one spam 
per day). This will evade our monitor sensors. However, 
this also puts a limit on the utility of bots. 

In addition, as discussed above, if the botmaster com- 
mands each bot randomly and individually to perform 
different task, the bots are not different from previous 
generations of isolated, individual malware instances. 
This is unlikely the way a botnet is used in practice. A 
more advanced evasion is to differentiate the bots and 
avoid commanding bots in the same monitored network 
the same way. This will cause additional effort and 
inconvenience for the botmaster. To defeat such an eva- 
sion, we can deploy distributed monitors on the Internet 
to cover a larger monitored space. 

Note, if the botmaster takes the extreme action of 
randomizing/individualizing both the C&C communica- 
tions and attack activities of each bots, then these bots 
are probably not part of a botnet according to our spe- 
cific definition because the bots are not performing sim- 
ilar/coordinated commanded activities. Orthogonal to 
the horizontal correlation approaches such as BotMiner 
to detect a botnet, we can always use complementary 
systems like BotHunter [19] that examine the behavior 
history of distinct host for a dialog or vertical correlation 
based approach to detect individual bots. 


4.3 Evading Cross-plane Analysis 


A botmaster can command the bots to perform an ex- 
tremely delayed task (e.g., delayed for days after re- 
ceiving commands). Thus, the malicious activities and 
C&C communications are in different days. If only 
using one day’s data, we may not be able to yield cross- 
plane clusters. As a solution, we may use multiple- 
day data and cross check back several days. Although 
this has the hope of capturing these botnets, it may also 
suffer from generating more false positives. Clearly, 
there is a trade-off. The botmaster also faces the trade- 
off because a very slow C&C essentially impedes the 
efficiency in controlling/coordinating the bot army. Also, 
a bot infected machine may be disconnected from the 
Internet or be powered off by the users during the delay 
and become unavailable to the botmaster. 

In summary, while it is possible that a botmaster can 
find a way to exploit the limitations of BotMiner, the 
convenience or the efficiency of botnet C&C and the 
utility of the botnet also suffer. Thus, we believe that 
our protocol- and structure-independent detection frame- 
work represents a significant advance in botnet detec- 
tion. 


5 Related Work 


To collect and analyze bots, researchers widely utilize 
honeypot techniques [4, 16, 32]. Freiling et al. [16] used 
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honeypots to study the problem of botnets. Nepenthes [4] 
is a special honeypot tool for automatic malware sam- 
ple collection. Rajab et al. [32] provided an in-depth 
measurement study of the current botnet activities by 
conducting a multi-faceted approach to collect bots and 
track botnets. Cooke et al. [10] conducted several basic 
studies of botnet dynamics. In [13], Dagon et al. pro- 
posed to use DNS sinkholing technique for botnet study 
and pointed out the global diurnal behavior of botnets. 
Barford and Yegneswaran [6] provided a detailed study 
on the code base of several common bot families. Collins 
et al. [9] presented their observation of a relationship 
between botnets and scanning/spamming activities. 


Several recent papers proposed different approaches to 
detect botnets. Ramachandran et al. [34] proposed using 
DNSBL counter-intelligence to find botnet members that 
generate spams. This approach is useful for just certain 
types of spam botnets. In [35], Reiter and Yen proposed 
a system TAMD to detect malware (including botnets) 
by aggregating traffic that shares the same external des- 
tination, similar payload, and that involves internal hosts 
with similar OS platforms. TAMD’s aggregation method 
based on destination networks focuses on networks that 
experience an increase in traffic as compared to a histor- 
ical baseline. Different from BotMiner that focuses on 
botnet detection, TAMD aims to detect a broader range 
of malware. Since TAMD’s aggregation features are 
different from BotMiner’s (in which we cluster similar 
communication patterns and similar malicious activity 
patterns), TAMD and BotMiner can complement each 
other in botnet and malware detection. Livadas et al. 
[29,40] proposed a machine learning based approach for 
botnet detection using some general network-level traffic 
features of chat-like protocols such as IRC. Karasaridis 
et al. [26] studied network flow level detection of IRC 
botnet controllers for backbone networks. The above two 
are similar to our work in C-plane clustering but different 
in many ways. First, they are used to detect IRC-based 
botnet (by matching a known IRC traffic profile), while 
we do not have the assumption of known C&C protocol 
profiles. Second, we use a different feature set on a 
new communication flow (C-flow) data format instead 
of traditional network flow. Third, we consider both 
C-plane and A-plane information instead of just flow 
records. 


Rishi [17] is a signature-based IRC botnet detection 
system by matching known IRC bot nickname patterns. 
Binkley and Singh [7] proposed combining IRC statistics 
and TCP work weight for the detection of IRC-based 
botnets. In [19], we described BotHunter, which is a 
passive bot detection system that uses dialog correla- 
tion to associate IDS events to a user-defined bot infec- 
tion dialog model. Different from BotHunter’s dialog 
correlation or vertical correlation that mainly examines 
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the behavior history associated with each distinct host, 
BotMiner utilizes a horizontal correlation approach that 
examines correlation across multiple hosts. BotSniffer 
[20] is an anomaly-based botnet C&C detection system 
that also utilizes horizontal correlation. However, it 
is used mainly for detecting centralized C&C activities 
(e.g., IRC and HTTP). 

The aforementioned systems are mostly limited to 
specific botnet protocols and structures, and many of 
them work only on IRC-based botnets. BotMiner is a 
novel general detection system that does not have such 
limitations and can greatly complement existing detec- 
tion approaches. 


6 Conclusion & Future Work 


Botnet detection is a challenging problem. In this pa- 
per, we proposed a novel network anomaly-based botnet 
detection system that is independent of the protocol and 
structure used by botnets. Our system exploits the essen- 
tial definition and properties of botnets, i.e., bots within 
the same botnet will exhibit similar C&C communication 
patterns and similar malicious activities patterns. In 
our experimental evaluation on many real-world network 
traces, BotMiner shows excellent detection accuracy on 
various types of botnets (including IRC-based, HTTP- 
based, and P2P-based botnets) with a very low false 
positive rate on normal traffic. 

It is likely that future botnets (especially P2P botnets) 
may utilize evasion techniques to avoid detection, as 
discussed in Section 4. In our future work, we will 
study new techniques to monitor/cluster communication 
and activity patterns of botnets, and these techniques 
are intended to be more robust to evasion attempts. In 
addition, we plan to further improve the efficiency of the 
C-flow converting and clustering algorithms, combine 
different correlation techniques (e.g., vertical correlation 
and horizontal correlation), and develop new real-time 
detection systems based on a layered design using sam- 
pling techniques to work in very high speed and very 
large network environments. 
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Abstract 


The abuse of chat services by automated programs, 
known as chat bots, poses a serious threat to Internet 
users. Chat bots target popular chat networks to dis- 
tribute spam and malware. In this paper, we first con- 
duct a series of measurements on a large commercial 
chat network. Our measurements capture a total of 14 
different types of chat bots ranging from simple to ad- 
vanced. Moreover, we observe that human behavior is 
more complex than bot behavior. Based on the mea- 
surement study, we propose a classification system to ac- 
curately distinguish chat bots from human users. The 
proposed classification system consists of two compo- 
nents: (1) an entropy-based classifier and (2) a machine- 
learning-based classifier. The two classifiers comple- 
ment each other in chat bot detection. The entropy-based 
classifier is more accurate to detect unknown chat bots, 
whereas the machine-learning-based classifier is faster 
to detect known chat bots. Our experimental evaluation 
shows that the proposed classification system is highly 
effective in differentiating bots from humans. 


1 Introduction 


Internet chat is a popular application that enables real- 
time text-based communication. Millions of people 
around the world use Internet chat to exchange messages 
and discuss a broad range of topics on-line. Internet 
chat is also a unique networked application, because of 
its human-to-human interaction and low bandwidth con- 
sumption [9]. However, the large user base and open na- 
ture of Internet chat make it an ideal target for malicious 
exploitation. 

The abuse of chat services by automated programs, 
known as chat bots, poses a serious threat to on-line 
users. Chat bots have been found on a number of chat 
systems, including commercial chat networks, such as 
AOL [15,29], Yahoo! [19, 25, 26, 28, 34] and MSN [16], 
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and open chat networks, such as IRC and Jabber. There 
are also reports of bots in some non-chat systems with 
chat features, including online games, such as World of 
Warcraft [7,32] and Second Life [27]. Chat bots exploit 
these on-line systems to send spam, spread malware, and 
mount phishing attacks. 


So far, the efforts to combat chat bots have focused 
on two different approaches: (1) keyword-based filtering 
and (2) human interactive proofs. The keyword-based 
message filters, used by third party chat clients [42, 43], 
suffer from high false negative rates because bot mak- 
ers frequently update chat bots to evade published key- 
word lists. The use of human interactive proofs, such as 
CAPTCHAs [1], is also ineffective because bot opera- 
tors assist chat bots in passing the tests to log into chat 
rooms [25,26]. In August 2007, Yahoo! implemented 
CAPTCHA to block bots from entering chat rooms, but 
bots are still able to enter chat rooms in large numbers. 
There are online petitions against both AOL and Ya- 
hoo! [28, 29], requesting that the chat service providers 
address the growing bot problem. While on-line systems 
are besieged with chat bots, no systematic investigation 
on chat bots has been conducted. The effective detec- 
tion system against chat bots is in great demand but still 
missing. 

In the paper, we first perform a series of measure- 
ments on a large commercial chat network, Yahoo! chat, 
to study the behaviors of chat bots and humans in on- 
line chat systems. Our measurements capture a total of 
14 different types of chat bots. The different types of 
chat bots use different triggering mechanisms and text 
obfuscation techniques. The former determines message 
timing, and the latter determines message content. Our 
measurements also reveal that human behavior is more 
complex than bot behavior, which motivates the use of 
entropy rate, a measure of complexity, for chat bot clas- 
sification. Based on the measurement study, we propose 
a Classification system to accurately distinguish chat bots 
from humans. There are two main components in our 
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classification system: (1) an entropy classifier and (2) a 
machine-learning classifier. Based on the characteristics 
of message time and size, the entropy classifier measures 
the complexity of chat flows and then classifies them as 
bots or humans. In contrast, the machine-learning clas- 
sifier is mainly based on message content for detection. 
The two classifiers complement each other in chat bot de- 
tection. While the entropy classifier requires more mes- 
sages for detection and, thus, is slower, it is more ac- 
curate to detect unknown chat bots. Moreover, the en- 
tropy classifier helps train the machine-learning classi- 
fier. The machine learning classifier requires less mes- 
sages for detection and, thus, is faster, but cannot detect 
most unknown bots. By combining the entropy classifier 
and the machine-learning classifier, the proposed classi- 
fication system is highly effective to capture chat bots, in 
terms of accuracy and speed. We conduct experimental 
tests on the classification system, and the results validate 
its efficacy on chat bot detection. 

The remainder of this paper is structured as follows. 
Section 2 covers background on chat bots and related 
work. Section 3 details our measurements of chat bots 
and humans. Section 4 describes our chat bot classifica- 
tion system. Section 5 evaluates the effectiveness of our 
approach for chat bot detection. Finally, Section 6 con- 
cludes the paper and discusses directions for our future 
work. 


2 Background and Related Work 


2.1 Chat Systems 


Internet chat is a real-time communication tool that al- 
lows on-line users to communicate via text in virtual 
spaces, called chat rooms or channels. There are a num- 
ber of protocols that support chat [17], including IRC, 
Jabber/XMPP, MSN/WLM (Microsoft), OSCAR (AOL), 
and YCHT/YMSG (Yahoo!). The users connect to a chat 
server via chat clients that support a certain chat protocol, 
and they may browse and join many chat rooms featuring 
a variety of topics. The chat server relays chat messages 
to and from on-line users. A chat service with a large 
user base might employ multiple chat servers. In addi- 
tion, there are several multi-protocol chat clients, such as 
Pidgin (formerly GAIM) and Trillian, that allow a user 
to join different chat systems. 

Although IRC has existed for a long time, it has not 
gained mainstream popularity. This is mainly because 
its console-like interface and command-line-based oper- 
ation are not user-friendly. The recent chat systems im- 
prove user experience by using graphic-based interfaces, 
as well as adding attractive features such as avatars, 
emoticons, and audio-video communication capabilities. 
Our study is carried out on the Yahoo! chat network, one 
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of the largest and most popular commercial chat systems. 

Yahoo! chat uses proprietary protocols, in which the 
chat messages are transmitted in plain-text, while com- 
mands, status and other meta data are transmitted as en- 
coded binary data. Unlike those on most IRC networks, 
users on the Yahoo! chat network cannot create chat 
rooms with customized topics because this feature is dis- 
abled by Yahoo! to prevent abuses [24]. In addition, 
users on Yahoo! chat are required to pass a CAPTCHA 
word verification test in order to join a chat room. This 
recently-added feature is to guard against a major source 
of abuse—bots. 


2.2 Chat Bots 


The term bot, short for robot, refers to automated pro- 
grams, that is, programs that do not require a human 
operator. A chat bot is a program that interacts with a 
chat service to automate tasks for a human, e.g., creating 
chat logs. The first-generation chat bots were designed to 
help operate chat rooms, or to entertain chat users, e.g., 
quiz or quote bots. However, with the commercializa- 
tion of the Internet, the main enterprise of chat bots is 
now sending chat spam. Chat bots deliver spam URLs 
via either links in chat messages or user profile links. A 
single bot operator, controlling a few hundred chat bots, 
can distribute spam links to thousands of users in differ- 
ent chat rooms, making chat bots very profitable to the 
bot operator who is paid per-click through affiliate pro- 
grams. Other potential abuses of bots include spreading 
malware, phishing, booting, and similar malicious activ- 
ities. 

A few countermeasures have been used to defend 
against the abuse of chat bots, though none of them are 
very effective. On the server side, CAPTCHA tests are 
used by Yahoo! chat in an effort to prevent chat bots 
joining chat rooms. However, this defense becomes in- 
effective as chat bots bypass CAPTCHA tests with hu- 
man assistance. We have observed that bots continue 
to join chat rooms and sometimes even become the ma- 
jority members of a chat room after the deployment of 
CAPTCHA tests. Third-party chat clients filter out chat 
bots, mainly based on key words or key phrases that are 
known to be used by chat bots. The drawback with this 
approach is that it cannot capture those unknown or eva- 
sive chat bots that do not use the known key words or 
phrases. 


2.3. Related Work 


Dewes et al. [9] conducted a systematic measurement 
study of IRC and Web-chat traffic, revealing several sta- 
tistical properties of chat traffic. (1) Chat sessions tend to 
last for a long time, and a significant number of IRC ses- 
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sions last much longer than Web-chat sessions. (2) Chat 
session inter-arrival time follows an exponential distribu- 
tion, while the distribution of message inter-arrival time 
is not exponential. (3) In terms of message size, all chat 
sessions are dominated by a large number of small pack- 
ets. (4) Over an entire session, typically a user receives 
about 10 times as much data as he sends. However, very 
active users in Web-chat and automated scripts used in 
IRC may send more data than they receive. 

There is considerable overlap between chat and instant 
messaging (IM) systems, in terms of protocol and user 
base. Many widely used chat systems such as IRC pre- 
date the rise of IM systems, and have great impact upon 
the IM system and protocol design. In return, some new 
features that make the IM systems more user-friendly 
have been back-ported to the chat systems. For exam- 
ple, IRC, a classic chat system, implements a number of 
IM-like features, such as presence and file transfers, in 
its current versions. Some messaging service providers, 
such as Yahoo!, offer both chat and IM accesses to their 
end-user clients. With this in mind, we outline some re- 
lated work on IM systems. Liu et al. [21] explored client- 
side and server-side methods for detecting and filtering 
IM spam or spim. However, their evaluation is based on a 
corpus of short e-mail spam messages, due to the lack of 
data on spim. In [23], Mannan et al. studied IM worms, 
automated malware that spreads on IM systems using the 
IM contact list. Leveraging the spreading characteristics 
of IM malware, Xie et al. [41] presented an IM malware 
detection and suppression system based on the honeypot 
concept. 

Botnets consist of a large number of slave computing 
assets, which are also called “bots”. However, the us- 
age and behavior of bots in botnets are quite different 
from those of chat bots. The bots in botnets are mali- 
cious programs designed specifically to run on compro- 
mised hosts on the Internet, and they are used as plat- 
forms to launch a variety of illicit and criminal activities 
such as credential theft, phishing, distributed denial-of- 
service attacks, etc. In contrast, chat bots are automated 
programs designed mainly to interact with chat users by 
sending spam messages and URLs in chat rooms. Al- 
though having been used by botnets as command and 
control mechanisms [2, 11], IRC and other chat systems 
do not play an irreplaceable role in botnets. In fact, due 
to the increasing focus on detecting and thwarting IRC- 
based botnets [8, 13, 14], recently emerged botnets, such 
as Phatbot, Nugache, Slapper, and Sinit, show a tendency 
towards using P2P-based control architectures [39]. 

Chat spam shares some similarities with email spam. 
Like email spam, chat spam contains advertisements of 
illegal services and counterfeit goods, and solicits hu- 
man users to click spam URLs. Chat bots employ many 
text obfuscation techniques used by email spam such 
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as word padding and synonym substitution. Since the 
detection of email spam can be easily converted into 
the problem of text classification, many content-based 
filters utilize machine-learning algorithms for filtering 
email spam. Among them, Bayesian-based statistical ap- 
proaches [6, 12, 20, 44, 45] have achieved high accuracy 
and performance. Although very successful, Bayesian- 
based spam detection techniques still can be evaded by 
carefully crafted messages [18, 22, 40]. 


3 Measurement 


In this section, we detail our measurements on Yahoo! 
chat, one of the most popular commercial chat services. 
The focus of our measurements is on public messages 
posted to Yahoo! chat rooms. The logging of chat mes- 
sages is available on the standard Yahoo! chat client, as 
well as most third party chat clients. Upon entering chat, 
all chat users are shown a disclaimer from Yahoo! that 
other users can log their messages. However, we con- 
sider the contents of the chat logs to be sensitive, so we 
only present fully-anonymized statistics. 

Our data was collected between August and Novem- 
ber of 2007. In late August, Yahoo! implemented a 
CAPTCHA check on entering chat rooms [5, 26], cre- 
ating technical problems that made their chat rooms un- 
stable for about two weeks [3,4]. At the same time, Ya- 
hoo! implemented a protocol update, preventing most 
third party chat clients, used by a large proportion of 
Yahoo! chat users, from accessing the chat rooms. In 
short, these upgrades made the chat rooms difficult to 
be accessed for both chat bots and humans. In mid to 
late September, both chat bot and third party client de- 
velopers updated their programs. By early October, chat 
bots were found in Yahoo! chat [25], possibly bypass- 
ing the CAPTCHA check with human assistance. Due 
to these problems and the lack of chat bots in September 
and early October, we perform our analysis on August 
and November chat logs. In August and November, we 
collected a total of 1,440 hours of chat logs. There are 
147 individual chat logs from 21 different chat rooms. 
The process of reading and labeling these chat logs re- 
quired about 100 hours. To the best of our knowledge, 
we are the first in the large scale measurement and clas- 
sification of chat bots. 


3.1 Log-Based Classification 


In order to characterize the behavior of human users and 
that of chat bots, we need two sets of chat logs pre- 
labeled as bots and humans. To create such datasets, we 
perform log-based classification by reading and labeling 
a large number of chat logs. The chat users are labeled 
in three categories: human, bot, and ambiguous. 
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The log-based classification process is a variation of 
the Turing test. In a standard Turing test [37], the exam- 
iner converses with a test subject (a possible machine) for 
five minutes, and then decides if the subject is a human 
or a machine. In our classification process, the examiner 
observes a long conversation between a test subject (a 
possible chat bot) and one or more third parties, and then 
decides if the subject is a human or a chat bot. In addi- 
tion, our examiner checks the content of URLs and typ- 
ically observes multiple instances of the same chat bot, 
which further improve our classification accuracy. More- 
over, given that the best practice of current artificial intel- 
ligences [36] can rarely pass a non-restricted Turing test, 
our classification of chat bots should be very accurate. 

Although a Turing test is subjective, we outline a few 
important criteria. The main criterion for being labeled 
as human is a high proportion of specific, intelligent, 
and human-like responses to other users. In general, if a 
user’s responses suggest more advanced intelligence than 
current state-of-the-art AI [36], then the user can be la- 
beled as human. The ambiguous label is reserved for 
non-English, incoherent, or non-communicative users. 
The criteria for being classified as bot are as follows. The 
first is the lack of the intelligent responses required for 
the human label. The second is the repetition of similar 
phrases either over time or from other users (other in- 
stances of the same chat bot). The third is the presence 
of spam or malware URLs in messages or in the user’s 
profile. 


3.2 Analysis 


In total, our measurements capture 14 different types of 
chat bots. The different types of chat bots are deter- 
mined by their triggering mechanisms and text obfusca- 
tion schemes. The former relates to message timing, and 
the latter relates to message content. The two main types 
of triggering mechanisms observed in our measurements 
are timer-based and response-based. A timer-based bot 
sends messages based on a timer, which can be peri- 
odic (i.e., fixed time intervals) or random (i.e., variable 
time intervals). A response-based bot sends messages 
based on programmed responses to specific content in 
messages posted by other users. 

There are many different kinds of text obfuscation 
schemes. The purpose of text obfuscation is to vary the 
content of messages and make bots more difficult to rec- 
ognize or appear more human-like. We observed four ba- 
sic text obfuscation methods that chat bots use to evade 
filtering or detection. First, chat bots introduce random 
characters or space into their messages, similar to some 
spam e-mails. Second, chat bots use various synonym 
phrases to avoid obvious keywords. By this method, a 
template with several synonyms for multiple words can 
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lead to thousands of possible messages. Third, chat bots 
use short messages or break up long messages into mul- 
tiple messages to evade message filters that work on a 
message-by-message basis. Fourth, and most interest- 
ingly, chat bots replay human phrases entered by other 
chat users. 

According to our observation, the main activity of chat 
bots is to send spam links to chat users. There are two 
approaches that chat bots use to distribute spam links in 
chat rooms. The first is to post a message with a spam 
link directly in the chat room. The second is to enter the 
spam URL in the chat bot’s user profile and then con- 
vince the users to view the profile and click the link. Our 
logs also include some examples of malware spreading 
via chat rooms. The behavior of malware-spreading chat 
bots is very similar to that of spam-sending chat bots, 
as both attempt to lure human users to click links. Al- 
though we did not perform detailed malware analysis on 
links posted in the chat rooms and Yahoo! applies filters 
to block links to known malicious files, we found several 
worm instances in our data. There are 12 W32.Imaut.AS 
[35] worms appeared in the August chat logs, and 23 
W32.Imaut.AS worms appeared in the November chat 
logs. The November worms attempted to send malicious 
links but were blocked by Yahoo! (the malicious links 
in their messages being removed), however, the August 
worms were able to send out malicious links. 

The focus of our measurements is mainly on short 
term statistics, as these statistics are most likely to be 
useful in chat bot classification. The two key measure- 
ment metrics in this study are inter-message delay and 
message size. Based on these two metrics, we profile the 
behavior of human and that of chat bots. Among chat 
bots, we further divide them into four different groups: 
periodic bots, random bots, responder bots, and replay 
bots. With respect to these short-term statistics, human 
and chat bots behave differently, as shown below. 


3.2.1 Humans 


Figure 1 shows the probability distributions of human 
inter-message delay and message size. Since the behav- 
ior of humans is persistent, we only draw the probabil- 
ity mass function (pmf) curves based on the August data. 
The previous study on Internet chat systems [9] observed 
that the distribution of inter-message delay in chat sys- 
tems was heavy tailed. In general our measurement result 
conforms to that observation. The body part of the pmf 
curve in Figure | (a) (log-log scale) can be linearly fitted, 
indicating that the distribution of human inter-message 
delays follows a power law. In other words, the distri- 
bution is heavy tailed. We also find that the pmf curve 
of human message size in Figure | (b) can be well fit- 
ted by an exponential distribution with A = 0.034 after 
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Figure 1: Distribution of human inter-message delay (a) and message size (b) 


excluding the initial spike. 


3.2.2. Periodic Bots 


A periodic bot posts messages mainly at regular time in- 
tervals. The delay periods of periodic bots, especially 
those bots that use long delays, may vary by several sec- 
onds. The variation of delay period may be attributed to 
either transmission delay caused by network traffic con- 
gestion or chat server delay, or message emission delay 
incurred by system overloading on the bot hosting ma- 
chine. The posting of periodic messages is a simple but 
effective mechanism for distributing messages, so it is 
not surprising that a substantial portion of chat bots use 
periodic timers. 

We display the probability distributions of inter- 
message delay and message size for periodic bots in Fig- 
ure 2. We use ‘+’ for displaying August data and ‘e’ 
for November data. The distributions of periodic bots 
are distinct from those of humans shown in Figure 1. 
The distribution of inter-message delay for periodic bots 
clearly manifests the timer-triggering characteristic of 
periodic bots. There are three clusters with high proba- 
bilities at time ranges [30-50], [100-110], and [150-170]. 
These clusters correspond to the November periodic bots 
with timer values around 40 seconds and the August peri- 
odic bots with timer values around 105 and 160 seconds, 
respectively. The message size pmf curve of the August 
periodic bots shows an interesting bell shape, much like a 
normal distribution. After examining message contents, 
we find that the bell shape may be attributed to the mes- 
sage composition method some August bots used. As 
shown in Appendix A, some August periodic bots com- 
pose a message using a single template. The template 
has several parts and each part is associated with several 
synonym phrases. Since the length of each part is inde- 
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pendent and identically distributed, the length of whole 
message, i.e., the sum of all parts, should approximate a 
normal distribution. The November bots employ a simi- 
lar composition method, but use several templates of dif- 
ferent lengths. Thus, the message size distribution of the 
November periodic bots reflects the distribution of the 
lengths of the different templates, with the length of each 
individual template approximating a normal distribution. 


3.2.3 Random Bots 


A random bot posts messages at random time intervals. 
The random bots in our data used different random distri- 
butions, some discrete and others continuous, to generate 
inter-message delays. The use of random timers makes 
random bots appear more human-like than periodic bots. 
In statistical terms, however, random bots exhibit quite 
different inter-message delay distributions than humans. 

Figure 3 depicts the probability distributions of inter- 
message delay and message size for random bots. Com- 
pared to periodic bots, random bots have more dispersed 
timer values. In addition, the August random bots have 
a large overlap with the November random bots. The 
points with high probabilities (greater than 10~) in the 
time range [30-90] in Figure 3 (a) represent the August 
and November random bots that use a discrete distribu- 
tion of 40, 64, and 88 seconds. The wide November 
cluster with medium probabilities in the time range [40- 
130] is created by the November random bots that use a 
uniform distribution between 45 and 125 seconds. The 
probabilities of different message sizes for the August 
and November random bots are mainly in the size range 
[0-50]. Unlike periodic bots, most random bots do not 
use template or synonym replacement, but directly re- 
peat messages. Thus, as their messages are selected from 
a database at random, the message size distribution re- 
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Figure 2: Distribution of periodic bot inter-message delay (a) and message size (b) 
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Figure 3: Distribution of random bot inter-message delay (a) and message size (b) 


flects the proportion of messages of different sizes in the 
database. 


3.2.4 Responder Bots 


A responder bot sends messages based on the content 
of messages in the chat room. For example, a message 
ending with a question mark may trigger a responder bot 
to send a vague response with a URL, as shown in Ap- 
pendix A. The vague response, in the context, may trick 
human users into believing that the responder is a human 
and further clicking the link. Moreover, the message trig- 
gering mechanism makes responder bots look more like 
humans in terms of timing statistics than periodic or ran- 
dom bots. 

To gain more insights into responder bots, we man- 
aged to obtain a configuration file for a typical responder 
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bot [38]. There are a number of parameters for making 
the responder bot mimic humans. The bot can be config- 
ured with a fixed typing rate, so that responses with dif- 
ferent lengths take different time to “type.” The bot can 
also be set to either ignore triggers while simulating typ- 
ing, or rate-limit responses. In addition, responses can 
be assigned with probabilities, so that the responder bot 
responds to a given trigger in a random manner. 


Figure 4 shows the probability distributions of inter- 
message delay and message size for responder bots. Note 
that only the distribution of the August responder bots is 
shown due to the small number of responder bots found 
in November. Since the message emission of respon- 
der bots is triggered by human messages, theoretically 
the distribution of inter-message delays of responder bots 
should demonstrate certain similarity to that of humans. 
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Figure 4: Distribution of responder bot inter-message delay (a) and message size (b) 
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Figure 5: Distribution of replay bot inter-message delay (a) and message size (b) 


Figure 4 (a) confirms this hypothesis. Like Figure | (a), 
the pmf of responder bots (excluding the head part) in 
log-log scale exhibits a clear sign of a heavy tail. But 
unlike human messages, the sizes of responder bot mes- 
sages vary in a much narrower range (between | and 
160). The bell shape of the distribution for message size 
less than 100 indicates that responder bots share a similar 
message composition technique with periodic bots, and 
their messages are composed as templates with multiple 
parts, as shown in Appendix A. 


3.2.5 Replay Bots 


A replay bot not only sends its own messages, but also 
repeats messages from other users to appear more like a 
human user. In our experience, replayed phrases are re- 
lated to the same topic but do not appear in the same chat 
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room as the original ones. Therefore, replayed phrases 
are either taken from other chat rooms on the same topic 
or saved previously in a database and replayed. 


The use of replayed phrases in a crowded or “noisy” 
chat room does, in fact, make replay bots look more like 
human to inattentive users. The replayed phrases are 
sometimes nonsensical in the context of the chat, but 
human users tend to naturally ignore such statements. 
When replay bots succeed in fooling human users, these 
users are more likely to click links posted by the bots 
or visit their profiles. Interestingly, replay bots some- 
times replay phrases uttered by other chat bots, making 
them very easy to be recognized. The use of replay is 
potentially effective in thwarting detection methods, as 
detection tests must deal with a combination of human 
and bots phrases. By using human phrases, replay bots 
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Figure 6: Classification System Diagram 


can easily defeat keyword-based message filters that fil- 
ter message-by-message, as the human phrases should 
not be filtered out. 


Figure 5 illustrates the probability distributions of 
inter-message delay and message size for replay bots. In 
terms of inter-message delay, a replay bot is just a varia- 
tion of a periodic bot, which is demonstrated by the high 
spike in Figure 5 (a). By using human phrases, replay 
bots successfully mimic human users in terms of mes- 
sage size distribution. The message size distribution of 
replay bots in Figure 5 (b) largely resembles that of hu- 
man users, and can be fitted by an exponential distribu- 
tion with A = 0.028. 


4 Classification System 


This section describes the design of our chat bot classi- 
fication system. The two main components of our clas- 
sification system are the entropy classifier and the ma- 
chine learning classifier. The basic structure of our chat 
bot classification system is shown in Figure 6. The two 
classifiers, entropy and machine learning, operate con- 
currently to process input and make classification deci- 
sions, while the machine learning classifier relies on the 
entropy classifier to build the bot corpus. The entropy 
classifier uses entropy and corrected conditional entropy 
to score chat users and then classifies them as chat bots or 
humans. The main task of the entropy classifier is to cap- 
ture new chat bots and add them to the chat bot corpus. 
The human corpus can be taken from a database of clean 
chat logs or created by manual log-based classification, 
as described in Section 3. The machine learning classi- 
fier uses the bot and human corpora to learn text patterns 
of bots and humans, and then it can quickly classify chat 
bots based on these patterns. The two classifiers are de- 
tailed as follows. 
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4.1 Entropy Classifier 


The entropy classifier makes classification decisions 
based on entropy and entropy rate measures of message 
sizes and inter-message delays for chat users. If either 
the entropy or entropy rate is low for these characteris- 
tics, it indicates the regular or predictable behavior of a 
likely chat bot. If both the entropy and entropy rate is 
high for these characteristics, it indicates the irregular or 
unpredictable behavior of a possible human. 

To use entropy measures for classification, we set a 
cutoff score for each entropy measure. If a test score is 
greater than or equal to the cutoff score, the chat user is 
classified as a human. If the test score is less than the 
cutoff score, the chat user is classified as a chat bot. The 
specific cutoff score is an important parameter in deter- 
mining the false positive and true positive rates of the en- 
tropy classifier. On the one hand, if the cutoff score is too 
high, then too many humans will be misclassified as bots. 
On the other hand, if the cutoff score is too low, then too 
many chat bots will be misclassified as humans. Due to 
the importance of achieving a low false positive rate, we 
select the cutoff scores based on human entropy scores to 
achieve a targeted false positive rate. The specific cutoff 
scores and targeted false positive rates are described in 
Section 5. 


4.1.1 Entropy Measures 


The entropy rate, which is the average entropy per ran- 
dom variable, can be used as a measure of complexity or 
regularity [10, 30, 31]. The entropy rate is defined as the 
conditional entropy of a sequence of infinite length. The 
entropy rate is upper-bounded by the entropy of the first- 
order probability density function or first-order entropy. 
A independent and identically distributed (i.i.d.) process 
has an entropy rate equal to its first-order entropy. A 
highly complex process has a high entropy rate, while a 
highly regular process has a low entropy rate. 
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A random process X = {X;} is defined as an indexed 
sequence of random variables. To give the definition of 
the entropy rate of a random process, we first define the 
entropy of a sequence of random variables as: 


A X py sey Xap) = 
S- P(21,...; 2m) log P(x1,---,Zm), 


where P(2x1,...,%m) is the joint probability P(X, = 
Dt cites Mey — ees) a 

Then, from the entropy of a sequence of random vari- 
ables, we define the conditional entropy of a random 
variable given a previous sequence of random variables 
as: 


H(Xmm | X14 .)Xm—1) = 


POG med HW Oi ea 


Lastly, the entropy rate of a random process is defined 
as: 


AG) = lim A(X, | Xa Ka): 


Since the entropy rate is the conditional entropy of a 
sequence of infinite length, it cannot be measure for fi- 
nite samples. Thus, we estimate the entropy rate with 
the conditional entropy of finite samples. In practice, 
we replace probability density functions with empirical 
probability density functions based on the method of 
histograms. The data is binned in Q bins of approxi- 
mately equal probability. The empirical probability den- 
sity functions are determined by the proportions of bin 
number sequences in the data, i.e., the proportion of a 
sequence is the probability of that sequence. The esti- 
mates of the entropy and conditional entropy, based on 
empirical probability density functions, are represented 
as: E'N and CE, respectively. 

There is a problem with the estimation of CE(X,p, | 
X1,...,;Xm_—1) for some values of m. The conditional 
entropy tends to zero as m increases, due to limited data. 
If a specific sequence of length m — 1 is found only once 
in the data, then the extension of this sequence to length 
m will also be found only once. Therefore, the length m 
sequence can be predicted by the length m — 1 sequence, 
and the length m and m — 1 sequences cancel out. If 
no sequence of length m is repeated in the data, then 
CE(Xm | X1,..-;Xm-—1) is zero, even for iid. pro- 
cesses. 

To solve the problem of limited data, without fixing 
the length of m, we use the corrected conditional en- 
tropy [30] represented as CCE. The corrected condi- 
tional entropy is defined as: 


CORK | Sits Soa = 
CE(Xm | X1,.-;Xm—1) + perc(Xm)- EN(X1), 
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where perc(X,,) is the percentage of unique sequences 
of length m and EN (Xj) is the entropy with m fixed at 
1 or the first-order entropy. 


The estimate of the entropy rate is the minimum of 
the corrected conditional entropy over different values of 
m. The minimum of the corrected conditional entropy 
is considered to be the best estimate of the entropy rate 
from the available data. 


4.2 Machine Learning Classifier 


The machine learning classifier uses the content of chat 
messages to identify chat bots. Since chat messages (in- 
cluding emoticons) are text, the identification of chat 
bots can be perfectly fitted into the domain of machine 
learning text classification. Within the machine learn- 
ing paradigm, the text classification problem can be for- 
malized as f : T x C — {0,1}, where f is the classi- 
fier, T = {t1, to, ...,tn} is the texts to be classified, and 
C = {c1, Co, ..., ce} is the set of pre-defined classes [33]. 
Value 1 for f(t;,¢;) indicates that text t; is in class c; 
and value 0 indicates the opposite decision. There are 
many techniques that can be used for text classification, 
such as naive Bayes, support vector machines, and deci- 
sion trees. Among them, Bayesian classifiers have been 
very successful in text classification, particularly in email 
spam detection. Due to the similarity between chat spam 
and email spam, we choose Bayesian classification for 
our machine learning classifier for detecting chat bots. 
We leave study on the applicability of other types of ma- 
chine learning classifiers to our future work. 


Within the framework of Bayesian classification, iden- 
tifying if chat message M is issued by a bot or hu- 
man is achieved by computing the probability of MW 
being from a bot with the given message content, i.e., 
P(C = bot|M). If the probability is equal to or greater 
than a pre-defined threshold, then message / is classi- 
fied as a bot message. According to Bayes theorem, 


Pia ae Com 
P(M|bot)P (bot) 


P(M|bot)P(bot) + P(M|human)P(human)- 


A message M is described by its feature vector 
(fi, fo;--; fn). A feature f is a single word or a com- 
bination of multiple words in the message. To simplify 
computation, in practice it is usually assumed that all fea- 
tures are conditionally independent with each other for 
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Table 1: Message Composition of Chat Bot and Human Datasets 





ty 


the given category. Thus, we have 
P(bot|M) = 
P(bot) T] P(fi|bot) 
i=1 
P(bot) T] P(fi|bot) + P(human) [J P(fijhuman) 
i=1 i=1 


The value of P(bot|M!) may vary in different imple- 
mentations (see [12,45] for implementation details) of 
Bayesian classification due to differences in assumption 
and simplification. 

Given the abundance of implementations of Bayesian 
classification, we directly adopt one implementation, 
namely CRM 114 [44], as our machine learning classi- 
fication component. CRM 114 is a powerful text clas- 
sification system that has achieved very high accuracy 
in email spam identification. The default classifier of 
CRM 114, OSB (Orthogonal Sparse Bigram), is a type 
of Bayesian classifier. Different from common Bayesian 
classifiers which treat individual words as features, OSB 
uses word pairs as features instead. OSB first chops the 
whole input into multiple basic units with five consec- 
utive words in each unit. Then, it extracts four word 
pairs from each unit to construct features, and derives 
their probabilities. Finally, OSB applies Bayes theorem 
to compute the overall probability that the text belongs 
to one class or another. 


5 Experimental Evaluation 


In this section, we evaluate the effectiveness of our pro- 
posed classification system. Our classification tests are 
based on chat logs collected from the Yahoo! chat sys- 
tem. We test the two classifiers, entropy-based and 
machine-learning-based, against chat bots from August 
and November datasets. The machine learning classi- 
fier is tested with fully-supervised training and entropy- 
classifier-based training. The accuracy of classification 
is measured in terms of false positive and false nega- 
tive rates. The false positives are those human users that 
are misclassified as chat bots, while the false negatives 
are those chat bots that are misclassified as human users. 
The speed of classification is mainly determined by the 
minimum number of messages that are required for accu- 
rate classification. In general, a high number means slow 
classification, whereas a low number means fast classifi- 
cation. 
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5.1 Experimental Setup 


The chat logs used in our experiments are mainly in three 
datasets: (1) human chat logs from August 2007, (2) bot 
chat logs from August 2007, and (3) bot chat logs from 
November 2007. In total, these chat logs contain 342,696 
human messages and 87,049 bot messages. In our exper- 
iments, we use the first half of each chat log, human and 
bot, for training our classifiers and the second half for 
testing our classifiers. The composition of the chat logs 
for the three datasets is listed in Table 1. 

The entropy classifier only requires a human training 
set. We use the human training set to determine the cutoff 
scores, which are used by the entropy classifier to decide 
whether a test sample is a human or bot. The target false 
positive rate is set at 0.01. To achieve this false positive 
rate, the cutoff scores are set at approximately the Ist 
percentile of human training set scores. Then, samples 
that score higher than the cutoff are classified as humans, 
while samples that score lower than the cutoff are clas- 
sified as bots. The entropy classifier uses two entropy 
tests: entropy and corrected conditional entropy. The en- 
tropy test estimates first-order entropy, and the corrected 
conditional entropy estimates higher-order entropy or en- 
tropy rate. The corrected conditional entropy test is more 
precise with coarse-grain bins, whereas the entropy test 
is more accurate with fine-grains bins [10]. Therefore, 
we use Q = 5 for the corrected conditional entropy test 
and @ = 256 with m fixed at 1 for the entropy test. 

We run classification tests for each bot type using 
the entropy classifier and machine learning classifier. 
The machine learning classifier is tested based on fully- 
supervised training and then entropy-based training. In 
fully-supervised training, the machine learning classifier 
is trained with manually labeled data, as described in 
Section 3. In entropy-based training, the machine learn- 
ing classifier is trained with data labeled by the entropy 
classifier. For each evaluation, the entropy classifier uses 
samples of 100 messages, while the machine learning 
classifier uses samples of 25 messages. 


5.2 Experimental Results 


We now present the results for the entropy classifier and 
machine learning classifier. The four chat bot types are: 
periodic, random, responder, and replay. The classifica- 
tion tests are organized by chat bot type, and are ordered 
by increasing detection difficulty. 
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Table 2: Entropy Classifier Accuracy 


aa sors SNOV.BOTS_ | HUMANS] 
[J reriogie [random [responder [periodic [ random [replay [human _| 





5.2.1 Entropy Classifier 


The detection results of the entropy classifier are listed 
in Table 2, which includes the results of the entropy test 
(EN) and corrected conditional entropy test (CCE) for 
inter-message delay (zmd), and message size (ms). The 
overall results for all entropy-based tests are shown in 
the final row of the table. The true positives are the total 
unique bot samples correctly classified as bots. The false 
positives are the total unique human samples mistakenly 
classified as bots. 

Periodic Bots: As the simplest group of bots, periodic 
bots are the easiest to detect. They use different fixed 
timers and repeatedly post messages at regular intervals. 
Therefore, their inter-message delays are concentrated in 
a narrower range than those of humans, resulting in lower 
entropy than that of humans. The inter-message delay 
EN and CCE tests detect 100% of all periodic bots in 
both August and November datasets. The message size 
EN and CCE tests detect 76% and 63% of the Au- 
gust periodic bots, respectively, and 90% and 100% of 
the November periodic bots, respectively. These slightly 
lower detection rates are due to a small proportion of hu- 
mans with low entropy scores that overlap with some pe- 
riodic bots. These humans post mainly short messages, 
resulting in message size distributions with low entropy. 

Random Bots: The random bots use random timers 
with different distributions. Some random bots use dis- 
crete timings, e.g., 40, 64, or 88 seconds, while the others 
use continuous timings, e.g., uniformly distributed de- 
lays between 45 and 125 seconds. 

The inter-message delay E.N and CCE tests detect 
100% of all random bots, with one exception: the inter- 
message delay C'C E test against the August random bots 
only achieves 72% detection rate, which is caused by the 
following two conditions: (1) the range of message de- 
lays of random bots is close to that of humans; (2) some- 
times the randomly-generated delay sequences have sim- 
ilar entropy rate to human patterns. The message size 
EN and CCE tests detect 31% and 6% of August ran- 
dom bots, respectively, and 7% and 8% of November 
random bots, respectively. These low detection rates are 
again due to a small proportion of humans with low mes- 
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sage size entropy scores. However, unlike periodic bots, 
the message size distribution of random bots is highly 
dispersed, and thus, a larger proportion of random bots 
have high entropy scores, which overlap with those of 
humans. 

Responder Bots: The responder bots are among the 
advanced bots, and they behave more like humans than 
random or periodic bots. They are triggered to post mes- 
sages by certain human phrases. As a result, their timings 
are quite similar to those of humans. 

The inter-message delay E.N and CCE tests detect 
very few responder bots, only 3% and 13%, respec- 
tively. This demonstrates that human-message-triggered 
responding is a simple yet very effective mechanism for 
imitating the timing of human interactions. However, the 
detection rate for the message size EN test is slightly 
better at 27%, and the detection rate for the message size 
CCE test reaches 100%. While the message size distri- 
bution has sufficiently high entropy to frequently evade 
the EN test, there is some dependence between subse- 
quent message sizes, and thus, the CC’'E detects the low 
entropy pattern over time. 

Replay Bots: The replay bots also belong to the ad- 
vanced and human-like bots. They use replay attacks to 
fool humans. More specifically, the bots replay phrases 
they observed in chat rooms. Although not sophisticated 
in terms of implementation, the replay bots are quite ef- 
fective in deceiving humans as well as frustrating our 
message-size-based detections: the message size EN 
and CCE tests both have detection rates of 0%. Despite 
their clever trick, the timing of replay bots is periodic 
and easily detected. The inter-message delay E.N and 
CCE tests are very successful at detecting replay bots, 
both with 100% detection accuracy. 


5.2.2, Supervised and Hybrid Machine Learning 
Classifiers 


The detection results of the machine learning classifier 
are listed in Table 3. Table 3 shows the results for the 
fully-supervised machine learning (SupML) classifier 
and entropy-trained machine learning (nt M L) classi- 
fier, both trained on the August training datasets, and the 
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Table 3: Machine Learning Classifier Accuracy 


auc Bors SSNOV.BOTS_ | HUMANS] 
_____FratiodicT random [ responder | periodic [ random [replay [human _| 


test 


et 


fully-supervised machine learning (SupM Lretrained) 
classifier trained on August and November training 
datasets. 


Periodic Bots: For the August dataset, both SupM L 
and EntM L classifiers detect 100% of all periodic bots. 
For the November dataset, however, the SupML clas- 
sifier only detects 27% of all periodic bots. The lower 
detection rate is due to the fact that 62% of the periodic 
bot messages in November chat logs are generated by 
new bots, making the Sup L classifier ineffective with- 
out re-training. The SupM Lretrained classifier detects 
100% of November periodic bots. The Ent L classi- 
fier also achieves 100% for the November dataset. 


Random Bots: For the August dataset, both SupML 
and EntML classifiers detect 100% of all random bots. 
For the November dataset, the Sup L classifier detects 
95% of all random bots, and the SupM Lretrained clas- 
sifier detects 100% of all random bots. Although 52% 
of the random bots have been upgraded according to 
our observation, the old training set is still mostly effec- 
tive because certain content features of August random 
bots still appear in November. The FntML classifier 
again achieves 100% detection accuracy for the Novem- 
ber dataset. 


Responder Bots: We only present the detection re- 
sults of responder bots for the August dataset, as the 
number of responder bots in the November dataset is 
very small. Although responder bots effectively mimic 
human timing, their message contents are only slightly 
obfuscated and are easily detected. The SupML and 
EntML classifiers both detect 100% of all responder 
bots. 


Replay Bots: The replay bots only exist in the 
November dataset. The SupML classifier detects only 
3% of all replay bots, as these bots are newly introduced 
in November. The SupM Lretrained classifier detects 
100% of all replay bots. The machine learning classifier 
reliably detects replay bots in the presence of a substan- 
tial number of replayed human phrases, indicating the 
effectiveness of machine learning techniques in chat bot 
classification. 
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6 Conclusion and Future Work 


This paper first presents a large-scale measurement study 
on Internet chat. We collected two-month chat logs for 
21 different chat rooms from one of the top Internet chat 
service providers. From the chat logs, we identified a to- 
tal of 14 different types of chat bots and grouped them 
into four categories: periodic bots, random bots, respon- 
der bots, and replay bots. Through statistical analysis on 
inter-message delay and message size for both chat bots 
and humans, we found that chat bots behave very differ- 
ently from human users. More specifically, chat bots ex- 
hibit certain regularities in either inter-message delay or 
message size. Although responder bots and replay bots 
employ advanced techniques to behave more human-like 
in some aspects, they still lack the overall sophistication 
of humans. 


Based on the measurement study, we further proposed 
a chat bot classification system, which utilizes entropy- 
based and machine-learning-based classifiers to accu- 
rately detect chat bots. The entropy-based classifier ex- 
ploits the low entropy characteristic of chat bots in either 
inter-message delay or message size, while the machine- 
learning-based classifier leverages the message content 
difference between humans and chat bots. The entropy- 
based classifier is able to detect unknown bots, includ- 
ing human-like bots such as responder and replay bots. 
However, it takes a relatively long time for detection, i.e., 
a large number of messages are required. Compared to 
the entropy-based classifier, the machine-learning-based 
classifier is much faster, i.e., a small number of messages 
are required. In addition to bot detection, a major task of 
the entropy-based classifier is to build and maintain the 
bot corpus. With the help of bot corpus, the machine- 
learning-based classifier is trained, and consequently, is 
able to detect chat bots quickly and accurately. Our ex- 
perimental results demonstrate that the hybrid classifica- 
tion system is fast in detecting known bots and is accu- 
rate in identifying previously-unknown bots. 


There are a number of possible directions for our fu- 
ture work. We plan to explore the application of entropy- 
based techniques in detecting other forms of bots, such 
as web bots. We also plan to investigate the development 
of more advanced chat bots that could evade our hybrid 
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classification system. We believe that the continued work 
in this area will reveal other important characteristics of 
bots and automated programs, which is useful in mal- 
ware detection and prevention. 


Acknowledgments 


We thank the anonymous reviewers for their insightful 
comments. This work was partially supported by NSF 
grants CNS-0627339 and CNS-0627340. Any opinions, 
findings, and conclusions or recommendations expressed 
in this material are those of the authors and do not neces- 
sarily reflect the views of the National Science Founda- 
tion. 


References 
[1] AHN, L. V., BLUM, M., HOPPER, N., AND LANGFORD, 


J. CAPTCHA: Using hard AI problems for security. In 
Proceedings of Eurocrypt (Warsaw, Poland, May 2003). 


[2] BACHER, P., HOLZ, T., KOTTER, M., AND WICH- 
ERSKI, G. Know your enemy: Tracking botnets, 
2005. http://www. honeynet.org/papers/ 
bots [Accessed: Jan. 25, 2008]. 

[3] BACON, S. Chat rooms follow-up. http: 
//www.ymessengerblog.com/blog/2007/ 
08/21/chat-—rooms-—follow-up/ [Accessed: Jan. 
25, 2008]. 

[4] BACON, S. Chat rooms update. http: 


//www.ymessengerblog.com/blog/2007/ 
08/24/chat-rooms-—update-2/ [Accessed: Jan. 
25, 2008]. 


[5] BACON, S. New entry process for chat rooms. http: 
//www.ymessengerblog.com/blog/2007/08/ 
29/new-entry-process-—for-—cha%t-—rooms/ 
[Accessed: Jan. 25, 2008]. 





[6 


= 


BLOSSER, J., AND JOSEPHSEN, D. Scalable centralized 
bayesian spam mitigation with bogofilter. In Proceedings 
of the 2004 USENIX Systems Administration Conference 
(LISA’04) (Atlanta, GA., USA, November 2004). 


[7 


— 


CRISLIP, D. Will Blizzard’s spam-stopper really work? 
http://www.wowinsider.com/2007/05/16/ 
will-blizzards—spam-—stopper-really-— 
work/ [Accessed: Dec. 25, 2007]. 


[8 


= 


DAGON, D., Gu, G., LEE, C. P., AND LEE, W. A tax- 
onomy of botnet structures. In Proceedings of the 2007 
Annual Computer Security Applications Conference (AC- 
SAC’07) (Miami, FL., USA, December 2007). 


[9] DEWEs, C., WICHMANN, A., AND FELDMANN, A. 
An analysis of Internet chat systems. In Proceedings of 
the 2003 ACM/SIGCOMM Internet Measurement Confer- 
ence (IMC’03) (Miami, FL., USA, October 2003). 


USENIX Association 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 


[16] 


[17] 


[18] 


[19] 


[20] 


[21] 


GIANVECCHIO, S., AND WANG, H. Detecting covert 
timing channels: An entropy-based approach. In Pro- 
ceedings of the 2007 ACM Conference on Computer and 
Communications Security (CCS’07) (Alexandria, VA., 
USA, October 2007). 


GOEBEL, J., AND HOLZ, T. Rishi: Identify bot contami- 
nated hosts by IRC nickname evaluation. In Proceedings 
of the USENIX Workshop on Hot Topics in Understand- 
ing Botnets (HotBots’07) (Cambridge, MA., USA, April 
2007). 


GRAHAM, P. A plan for spam, 2002. http://www. 
paulgraham.com/spam.html [Accessed: Jan. 25, 
2008]. 


Gu, G., PORRAS, P., YEGNESWARAN, V., FONG, M., 
AND LEE, W. Bothunter: Detecting malware infection 
through IDS-driven dialog correlation. In Proceedings 
of the 2007 USENIX Security Symposium (Security’07) 
(Boston, MA., USA, August 2007). 


Gu, G., ZHANG, J., AND LEE, W. BotSniffer: De- 
tecting botnet command and control channels in network 
traffic. In Proceedings of the 2008 Annual Network and 
Distributed System Security Symposium (NDSS’08) (San 
Diego, CA., USA, February 2008). 


Hu, J. AOL: spam and chat don’t mix. http: 
//www.news.com/AOL-Spam-—and-chat-— 
dont-mix/2100-1032_3-1024010.html [Ac- 
cessed: Jan. 7, 2008]. 


Hu, J. Shutting of MSN chat rooms may open up IM. 
http: //www.news.com/Shutting-—of-—MSN- 
chat-—rooms-—may-open-up-IM/2100-1025_ 
3-5082677.html1 [Accessed: Jan. 7, 2008]. 


JENNINGS III, R. B., NAHUM, E. M., OLSHEFSKI, 
D. P., SAHA, D., SHAE, Z.-Y., AND WATERS, C. A 
study of internet instant messaging and chat protocols. 
IEEE Network Vol. 20, No. 4 (2006), 16-21. 


KARLBERGER, C., BAYLER, G., KRUEGEL, C., AND 
KIRDA, E. Exploiting redundancy in natural language 
to penetrate bayesian spam filters. In Proceedings of the 
USENIX Workshop on Offensive Technologies (Boston, 
MA., USA, August 2007). 


KREBS, B. Yahoo! messenger network overrun by 
bots. http://blog.washingtonpost.com/ 
securityfix/2007/08/yahoo_messenger_ 

network_overru.html [Accessed: Dec. 18, 2007]. 


LI, K., AND ZHONG, Z._ Fast statistical spam filter 
by approximate classifications. In Proceedings of 2006 
ACM/SIGMETRICS International Conference on Mea- 
surement and Modeling of Computer Systems (St. Malo, 
France, June 2006). 


Liu, Z., LIN, W., LI, N., AND LEE, D. Detecting and 
filtering instant messaging spam - a global and person- 
alized approach. In Proceedings of the IEEE Workshop 
on Secure Network Protocols (NPSEC’05) (Boston, MA., 
USA, November 2005). 


17th USENIX Security Symposium —- 167 


168 


[22] 


[23] 


[24] 


[25] 


[26] 


[27] 


[28] 


[29] 


[30] 


[31] 


[32] 


[33] 


[34] 


[35] 


Lowp, D., AND MEEK, C. Good word attacks on sta- 
tistical spam filters. In Proceedings of the 2005 Con- 
ference on Email and Anti-Spam (CEAS’05) (Mountain 
View, CA., USA, July 2005). 

MANNAN, M., AND VAN OORSCHOT, P. C. On instant 
messaging worms, analysis and countermeasures. In Pro- 
ceedings of the ACM Workshop on Rapid Malcode (Fair- 
fax, VA., USA, November 2005). 


MILLS, E. Yahoo! closes chat rooms over child sex con- 
cerns. http://www.news.com/Yahoo-closes-— 





chat-rooms-over-—/child-sex-—concerns/ 
2100-1025_3-5759705.html [Accessed: Jan. 27, 
2008]. 

Monta, A. Bots are back in Yahoo! chat rooms. 
http://www.technospot.net/blogs/bots- 


are-back-in-yahoo-chat-room/  [Accessed: 
Dec. 18, 2007]. 


Monta, A. Yahoo! 
remove bots. 


chat adds CAPTCHA check to 
http://www.technospot.net/ 
blogs/yahoo-chat-—captcha-check-to- 
remove-bots/ [Accessed: Dec. 18, 2007]. 


NINO, T. Linden Lab taking action against land- 
bots. http://www.secondlifeinsider.com/ 
2007/05/18/linden-lab-taking-action- 
against-landbots/ [Accessed: Jan. 7, 2008]. 


PETITION ONLINE. Action against the Yahoo! bot 
problem petition. http: //www.petitiononline. 
com/ [Accessed: Dec. 18, 2007]. 


PETITION ONLINE. AOL no more chat room spam pe- 
tition. http: //www.petitiononline.com/ [Ac- 
cessed: Dec. 18, 2007]. 


PorTA, A., BASELLI, G., LIBERATI, D., MON- 
TANO, N., COGLIATI, C., GNECCHI-RUSCONE, T., 
MALLIANI, A., AND CERUTTI, S. Measuring regular- 
ity by means of a corrected conditional entropy in sym- 
pathetic outflow. Biological Cybernetics Vol. 78, No. 1 
(January 1998). 


ROSIPAL, R. Kernel-Based Regression and Objective 
Nonlinear Measures to Assess Brain Functioning. PhD 
thesis, University of Paisley, Paisley, Scotland, UK, 
September 2001. 


SCHRAMM, M. Chat spam measures shut 
down multi-line reporting add-ons. http: 
//www.wowinsider.com/2007/10/25/chat-— 
spam-—measures-—shut—down-multi-lin 
reporting—addons/ [Accessed: Jan. 17, 2008]. 





SEBASTIANI, F. Machine learning in automated text 
categorization. ACM Computing Surveys Vol. 34, No. 1 
(2002), 1-47. 


SIMPSON, C. Yahoo! chat anti-spam resource center. 
http://www.chatspam.org/ [Accessed: Sep. 25, 
2007]. 


SYMANTEC SECURITY RESPONSE. W32.Imaut.AS 
worm. http: //www.symantec.com/security_ 
response/writeup. jsp?docid=2007- 
080114-2713-99 [Accessed: Jan. 25, 2008]. 


17th USENIX Security Symposium 


[36] 


[37] 


[38] 


[39] 


[40] 


[41] 


[42] 


[43] 


[44] 


[45] 


THE ALICE ARTIFICIAL INTELLIGENCE FOUNDA- 
TION. ALICE(Artificial Linguistice Internet Computer 
Entity). http://www.alicebot.org/ [Accessed: 
Jan. 25, 2008]. 


TURING, A. M. Computing machinery and intelligence. 
Mind Vol. 59 (1950), 433-460. 


UBER-GEEK.COM. Yahoo! responder bot. http:// 
www.uber-geek.com/bot.html [Accessed: Jan. 
18, 2008]. 


WANG, P., SPARKS, S., AND ZOU, C. C. An advanced 
hybrid peer-to-peer botnet. In Proceedings of the USENIX 
Workshop on Hot Topics in Understanding Botnets (Hot- 
Bots’05) (Cambridge, MA., USA, April 2007). 


WITTEL, G. L., AND WU, S. F. On attacking statistical 
spam filters. In Proceedings of the 2004 Conference on 
Email and Anti-Spam (CEAS’04) (Mountain View, CA., 
USA, July 2004). 


XIE, M., Wu, Z., AND WANG, H. HoneyIM: Fast 
detection and suppression of instant messaging malware 
in enterprise-like networks. In Proceedings of the 2007 
Annual Computer Security Applications Conference (AC- 
SAC’07) (Miami Beach, FL, USA, December 2007). 


YAHELITE.ORG. Yahelite chat client. http://www. 
yahelite.org/ [Accessed: Jan. 8, 2008]. 


YAZAKPRO.COM. Yazak pro chat client. http:// 
www. yazakpro.com/ [Accessed: Jan. 8, 2008]. 


YERAZUNIS, B. CRM114 - the controllable regex mu- 
tilator, 2003. http: //crm114.sourceforge.net 
[Accessed: Jan. 25, 2008]. 


ZDZIARSKI, J. A. Ending Spam: Bayesian Content Fil- 
tering and the Art of Statistical Language Classification. 
No Starch Press, 2005. 


USENIX Association 


A Chat Bot Examples 


Note that in a chat room the following example messages would be spread out over several minutes. 


Example 1: Response Template 


bot: userl, that’s a damn good question. 

bot: userl, To know more about Seventh-day Adventist; visit http://www.sda.org 
Sabbath; http://www.sabbathtruth.com EGW; http://www.whiteestate.org 

bot: user2, no! don’t leave me. 





bot: userl, too much coffee tonight? 

bot: user2, boy, you’re just full of questions, aren’t you? 

bot: user2, lots of evidence for evolution can be found here http://www.talk 
origins.org/faqs/comdesc/ 


In the above example, the bot uses a template with three parts to post links: 
[username], [link description phrase] [link]. 


Example 2: Synonym Template 





bot: Allo Hunks! Enjoy Marjorie! Check My Free Pics 

bot: What’s happening Guys! Marjorie Here! See more of me at My Free Pics 
bot: Hi Babes! I am Marjorie! Rate My Live Cam 

bot: Horny lover Guys! Marjorie at your service! Inspect My Site 

bot: Mmmm Folks! Im Marjorie! View My Webpage 


In the above example, the bot uses a template with three parts to post messages: 
[salutation phrase]! [introduction phrase]! [web site advertisement phrase]. 


Example 3: Character Padding 


bot: anyone boredjn wanna chat?uklcss 

bot: any guystfrom the US/Canada hereiqjss 

bot: hiyafxqss 

bot: nel hereqbored?figqss 

bot: ne guysmwanna chat? ciuneed somel to make megsmile :-—)pktpss 


In the above example, the bot adds random characters to messages. 


USENIX Association 17th USENIX Security Symposium 169 
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Abstract 


We present an automated, scalable, method for craft- 
ing dynamic responses to real-time network requests. 
Specifically, we provide a flexible technique based on 
natural language processing and string alignment tech- 
niques for intelligently interacting with protocols trained 
directly from raw network traffic. We demonstrate the 
utility of our approach by creating a low-interaction web- 
based honeypot capable of luring attacks from search 
worms targeting hundreds of different web applications. 
In just over two months, we witnessed over 368, 000 
attacks from more than 5,600 botnets targeting several 
hundred distinct webapps. The observed attacks included 
several exploits detected the same day the vulnerabilities 
were publicly disclosed. Our analysis of the payloads of 
these attacks reveals the state of the art in search-worm 
based botnets, packed with surprisingly modular and di- 
verse functionality. 


1 Introduction 


Automated network attacks by malware pose a signif- 
icant threat to the security of the Internet. Nowadays, 
web servers are quickly becoming a popular target for 
exploitation, primarily because once compromised, they 
open new avenues for infecting vulnerable clients that 
subsequently visit these sites. Moreover, because web 
servers are generally hosted on machines with signifi- 
cant system resources and network connectivity, they can 
serve as reliable platforms for hosting malware (particu- 
larly in the case of server farms), and as such, are entic- 
ing targets for attackers [25]. Indeed, lately we have wit- 
nessed a marked increase in so-called “search worms” 
that seek out potential victims by crawling the results 
returned by malevolent search-engine queries [24, 28]. 
While this new change in the playing field has been noted 
for some time now, little is known about the scope of this 
growing problem. 
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To better understand this new threat, researchers and 
practitioners alike have recently started to move towards 
the development of low-interaction, web-based honey- 
pots [3]. These differ from traditional honeypots in that 
their only purpose is to monitor automated attacks di- 
rected at vulnerable web applications. However, web- 
based honeypots face a unique challenge—they are in- 
effective if not broadly indexed under the same queries 
used by malware to identify vulnerable hosts. At the 
same time, the large number of different web applica- 
tions being attacked poses a daunting challenge, and the 
sheer volume of attacks calls for efficient solutions. Un- 
fortunately, current web-based honeypot projects tend 
to be limited in their ability to easily simulate diverse 
classes of vulnerabilities, require non-trivial amounts of 
manual support, or do not scale well enough to meet this 
challenge. 

A fundamental difference between the type of mal- 
ware captured by traditional honeypots (e.g., Hon- 
eyd [23]) and approaches geared towards eliciting pay- 
loads from search-based malware stems from how poten- 
tial victims are targeted. For traditional honeypots, these 
systems can be deployed at a network telescope [22], for 
example, and can simply take advantage of the fact that 
for random scanning malware, any traffic that reaches 
the telescope is unsolicited and likely malicious in na- 
ture. However, search-worms use a technique more akin 
to instantaneous hit-list automation, thereby only target- 
ing authentic and vulnerable hosts. Were web-based hon- 
eypots to mimic the passive approach used for traditional 
honeypots, they would likely be very ineffective. 

To address these limitations, we present a method for 
crafting dynamic responses to on-line network requests 
using sample transcripts from observed network inter- 
action. In particular, we provide a flexible technique 
based on natural language processing and string align- 
ment techniques for intelligently interacting with proto- 
cols trained directly from raw traffic. Though our ap- 
proach is application-agnostic, we demonstrate its util- 
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ity with a system designed to monitor and capture au- 
tomated network attacks against vulnerable web appli- 
cations, without relying on static vulnerability signa- 
tures. Specifically, our approach (disguised as a typical 
web server) elicits interaction with search engines and, 
in turn, search worms in the hope of capturing their il- 
licit payload. As we show later, our dynamic content 
generation technique is fairly robust and easy to deploy. 
Over a 72-day period we were attacked repeatedly, and 
witnessed more than 368,000 attacks originating from 
28,856 distinct IP addresses. 

The attacks target a wide range of web applications, 
many of which attempt to exploit the vulnerable appli- 
cation(s) via a diverse set of injection techniques. To 
our surprise, even during this short deployment phase, 
we witnessed several attacks immediately after public 
disclosure of the vulnerabilities being exploited. That, 
by itself, validates our technique and underscores both 
the tenacity of attackers and the overall pervasiveness 
of web-based exploitation. Moreover, the relentless na- 
ture of these attacks certainly sheds light on the scope of 
this problem, and calls for immediate solutions to better 
curtail this increasing threat to the security of the Inter- 
net. Lastly, our forensic analysis of the captured pay- 
loads confirms several earlier findings in the literature, as 
well as highlights some interesting insights on the post- 
infection process and the malware themselves. 

The rest of the paper is organized as follows. Sec- 
tion 2 discusses related work. We provide a high-level 
overview of our approach in Section 3, followed by 
specifics of our generation technique in Section 4. We 
provide a validation of our approach based on interaction 
with a rigid binary protocol in Section 5. Additionally, 
we present our real-world deployment and discuss our 
findings in Section 6. Finally, we conclude in Section 7. 


2 Related Work 


Generally speaking, honeypots are deployed with the in- 
tention of eliciting interaction from unsuspecting adver- 
saries. The utility in capturing this interaction has been 
diverse, allowing researchers to discover new patterns 
and trends in malware propagation [28], generate new 
signatures for intrusion-detection systems and Internet 
security software [16, 20, 31], collect malware binaries 
for static and/or dynamic analysis [21], and quantify ma- 
licious behavior through widespread measurement stud- 
ies [26], to name a few. 

The adoption of virtual honeypots by the security com- 
munity only gained significant traction after the introduc- 
tion of low-interaction honeypots such as Honeyd [23]. 
Honeyd is a popular tool for establishing multiple virtual 
hosts on a single machine. Though Honeyd has proved 
to be fairly useful in practice, it is important to recognize 
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that its effectiveness is strictly tied to the availability of 
accurate and representative protocol-emulation scripts, 
whose generation can be fairly tedious and time con- 
suming. High-interaction honeypots use a different ap- 
proach, replying with authentic and unscripted responses 
by hosting sand-boxed virtual machines running com- 
mon software and operating systems [11]!. 

A number of solutions have been proposed to bridge 
the separation of benefits and restrictions that exist be- 
tween high and low-interaction honeypots. For example, 
Leita et al. proposed ScriptGen [18, 17], a tool that auto- 
matically generates Honeyd scripts from network traffic 
logs. ScriptGen creates a finite state machine for each 
listening port. Unfortunately, as the amount and diver- 
sity of available training data grows, so does the size and 
complexity of its state machines. Similarly, RolePlayer 
(and its successor, GQ [10]) generates scripts capable of 
interacting with live traffic (in particular, worms) by ana- 
lyzing series of similar application sessions to determine 
static and dynamic fields and then replay appropriate re- 
sponses. This is achieved by using a number of heuris- 
tics to remove common contextual values from annotated 
traffic samples and using byte-sequence alignment to find 
potential session identifiers and length fields. 

While neither of these systems specifically target 
search-based malware, they represent germane ap- 
proaches and many of the secondary techniques they in- 
troduce apply to our design as well. Also, their respective 
designs illustrate an important observation—the choice 
between using a small or large set of sample data man- 
ifests itself as a system tradeoff: there is little diversity 
to the requests recognized and responses transmitted by 
RolePlayer, thereby limiting its ability to interact with 
participants whose behavior deviates from the training 
session(s). On the other hand, the flexibility provided by 
greater state coverage in ScriptGen comes at a cost to 
scalability and complexity. 

Lastly, since web-based honeypots rely on search en- 
gines to index their attack signatures, they are at a disad- 
vantage each time a new attack emerges. In our work, we 
sidestep the indexing limitations common to static signa- 
ture web-based honeypots and achieve broad query rep- 
resentation prior to new attacks by proactively generating 
“signatures” using statistical language models trained on 
common web-application scripts. When indexed, these 
signatures allow us to monitor attack behavior conducted 
by search worms without explicitly deploying structured 
signatures a priori. 


3 High-level Overview 


We now briefly describe our system architecture. Its 
setup follows the description depicted in Figure 1, which 
is conceptually broken into three stages: pre-processing, 
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Pre-Processing 


; Train TF-IDF with corpus of 
Collect and encode requests 


network samples 


Cluster requests with iterative s 
k-medoids using TF-IDF 
distance 


Split samples into pairs of 
requests and corresponding 
responses 


Classification 


— —— 
Merge similar clusters (rejoin 
medoids and discard clusters) 





Language-Model Generation 


Build language model 
for each medoid 


7 
Find appropriate cluster 
for each request 
8 
Train language model with 
corresponding responses 


Figure 1: Setup consists of three distinct stages conducted in tandem in preparation for deployment. 


classification, and language-model generation. We ad- 
dress each part in turn. We note that although our 
methodology is not protocol specific, for pedagogical 
reasons, we provide examples specific to the DNS pro- 
tocol where appropriate. Our decision to use DNS for 
validation stems from the fact that validating the correct- 
ness of an HTTP response is ill-defined. Likewise, many 
ASCII-based protocols that come to mind (e.g., HTTP, 
SMTP, IRC) lack strict notions of correctness and so 
do not serve as a good conduit to demonstrate the cor- 
rectness of the output we generate. 


To begin, we pre-process and sanitize all trace data 
used for training. Network traces are stripped of trans- 
port protocol headers and organized by session into pairs 
of requests and responses. Any trace entries that cor- 
respond to protocol errors (e.g., HTTP 404) are omit- 
ted. Next, we group request and response pairs using 
a variant of iterative k-means clustering with TF/IDF 
(i.e., term frequency-inverse document frequency) co- 
sine similarity as our distance metric. Formally, we 
apply a k-medoids algorithm for clustering, which as- 
signs samples from the data as cluster medoids (i.e., cen- 
troids) rather than numerical averages. For reasons that 
should become clear later, pair similarity is based solely 
on the content of the request samples. Upon completion, 
we then generate and train a collection of smoothed n- 
gram language-models for each cluster. These language- 
models are subsequently used to produce dynamic re- 
sponses to online requests. However, because message 
formats may contain session-specific fields, we also post- 
process responses to satisfy these dependencies when- 
ever they can be automatically inferred. For example, in 
DNS, a session identifier uniquely identifies each record 
request with its response. 


During a live deployment, online-classification is used 
to deduce the response that is most similar to the in- 
coming request (i.e., by mapping the response to its best 
medoid). For instance, a DNS request for an MX record 
will ideally match a medoid that maps to other MX re- 
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quests. The medoid with the minimum TF / IDF distance 
to an online request identifies which language model is 
used for generating responses. The language models are 
built in such a way that they produce responses influ- 
enced by the training data. The overall process is de- 
picted in Figure 2. For our evaluation as a web-based 
honeypot (in Section 6), this process is used in two dis- 
tinct stages: first, when interacting with search engines 
for site indexing and second, when courting malware. 


4 Under the Hood 


In what follows, we now present more specifics about 
our design and implementation. Recall that our goal is 
to provide a technique for automatically providing valid 
responses to protocols interactions learned directly from 
raw traffic. 

In lieu of semantic knowledge, we instead apply clas- 
sic pattern classification techniques for partitioning a set 
of observed requests. In particular, we use the iterative 
k-medoids algorithm. As our distance metric we choose 
to forgo byte-sequence alignment approaches that have 
been previously used to classify similarities between pro- 
tocol messages (e.g, [18, 9, 6]). As Cui et. al. observed, 
while these approaches are appropriate for classifying 
requests that only differ parametrically, byte-sequence 
alignment is ill-suited for classifying messages with dif- 
ferent byte-sequences [8]. Therefore, we use TF/IDF 
cosine similarity as our distance metric. 

Intuitively, term frequency-inverse document fre- 
quency (TF / IDF) is the measure of a term’s significance 
to a string or document given its significance among a set 
of documents (or corpus). TF / IDF is often used in infor- 
mation retrieval for a number of applications including 
automatic text retrieval and approximate string match- 
ing [29]. Mathematically, we compute TF/IDF in the 
following way: let Tz, denote how often the term 7 ap- 
pears in document d; such that d; € D, a collection of 
documents. Then TF / IDF = TF - IDF where 
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The term-similarity between two strings from the 
same corpus can be computed by calculating their 
TF/IDF distance. To do so, both strings are first rep- 
resented as multi-dimensional vectors. For each term in 
a string (e.g., word), its TF/IDF value is computed as 
described previously. Then, for a string with n terms, an 
n-dimensional vector is formed using these values. The 
cosine of the angle between two such vectors represent- 
ing strings indicates a measure of their similarity (hence, 
its complement is a measure of distance). 

In the context of our implementation, terms are delin- 
eated by tokenizing requests into the following classes: 
one or more spaces, one or more printable characters (ex- 
cluding spaces), and one or more non-printable charac- 
ters (also excluding spaces).” We chose the space char- 
acter as a primary term delimiter due to its common oc- 
currence in text-based protocols; however, the delimiter 
could have easily been chosen automatically by identi- 
fying the most frequent byte in all requests. The collec- 
tion of all requests (and their constituent terms) form the 
TF/IDF corpus. 

Once TF/IDF training is complete we use an iterative 
k-medoids algorithm, shown in Algorithm 1, to identify 
similar requests. Upon completion, the classification al- 
gorithm produces a k (or less) partitioning over the set 
of all requests. In an effort to rapidly classify online re- 
quests in a memory-efficient manner, we retain only the 
medoids and dissolve all clusters. For our deployment, 
we empirically choose k = 30, and then perform a triv- 
ial cluster-collapsing algorithm: we iterate through the 
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k clusters and, for each cluster, calculate the mean and 
standard deviation of the distance between the medoid 
and the other members of the cluster. Once the k-means 
and standard deviations are known, we collapse pairs of 
clusters if the medoid requests are no more than one stan- 
dard deviation apart. 


4.1 Dynamic Response Generation 


Since one of the goals of our method is to generate not 
only valid but also dynamic responses to requests, we 
employ natural language processing techniques (NLP) to 
create models of protocols. These models, termed /an- 
guage models, assign probabilities of occurrence to se- 
quences of tokens based on a corpus of training data. 
With natural languages such as English we might define 
a token or, more accurately, a /-gram as a string of char- 
acters (i.e., a word) delimited by spaces or other punctu- 
ation. However, given that we are not working with nat- 
ural languages, we define a new set of delimiters for pro- 
tocols. The 1-gram token in our model adheres to one of 
the following criteria: (1) one or more spaces, (2) one or 
more printable characters, (3) one or more non-printable 
characters, or (4) the beginning of message (BOM) or end 
of message (EOM) tokens. 

The training corpora we use contain both requests and 
responses. Adhering to our assumption that similar re- 
quests have similar responses, we train & response lan- 
guage models on the responses associated with each of 
the k request clusters. That is, each cluster’s response 
language model is trained on the packets seen in response 
to the requests in that cluster. Recall that to avoid having 
to keep every request cluster in memory, we keep only 
the medoids for each cluster. Then, for each (request, re- 
sponse) tuple, we recalculate the distance to each of the 
k request medoids. The medoid with the minimal dis- 
tance to the tuple’s request identifies which of the k lan- 
guage models is trained using the response. After train- 
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: repeat 


for all I € MedoidSet do 
Distance — TF/IDF(R,M) 


RequestMap(R] — M 
for all M € MedoidSet do 


SOE GO SN aN de Oe? RS 


for all R € (ObservedRequests — MedoidS et) do 


: MedoidSet — SelectK RandomElements(ObservedRequests) 
: RequestMap < RequestT ype, MedoidT ype > —  // for mapping requests to medoids 


if Request Map|R| = L or Distance < TF/IDF(RequestMap|R], M) then 


10: M — FindMemberWithLowestMeanDistance(M, RequestMap) 


— 


: until HasConverged(MedoidSet) 
12: for all (Mi, Mj) € MedoidSet s.t.i £4 j do 


3: if TF/IDF (Mi, M;) < FindThresholdDistance(M;, M;) then 


14: MedoidSet — MedoidSet — {Mi, M;} 


15: MedoidSet — MedoidSet U Merge(M;, M;, RequestMap) 


Algorithm 1: Iterative k-Medoids Classification for Observed Requests 


ing concludes, each of the k response language models 
has a probability of occurrence associated with each ob- 
served sequence of 1-grams. A sequence of two 1-grams 
is called a 2-gram, a sequence of three 1-grams is called a 
3-gram, and so on. We cut the maximum n-gram length, 
n, to eight. 

Since it is unlikely that we have witnessed every pos- 
sible n-gram during training, we use a technique called 
smoothing to lend probability to unobserved sequences. 
Specifically, we use parametric Witten-Bell back-off 
smoothing [30], which is the state of the art for n-gram 
models. This smoothing method estimates, if we con- 
sider 3-grams, the 3-gram probability by interpolating 
between the naive count ratio C(wi wow 3)/C(wiwa2) 
and a recursively smoothed probability estimate of the 2- 
gram probability P(w3|w2). The recursively smoothed 
probabilities are less vulnerable to low counts because of 
the shorter context. A 2-gram is more likely to occur in 
the training data than a 3-gram and the trend progresses 
similarly as the n-gram length decreases. By smoothing, 
we get a reasonable estimate of the probability of occur- 
rence for all possible n-grams even if we have never seen 
it during training. Smoothing also mitigates the possibil- 
ity that certain n-grams dominate in small training cor- 
pora. It is important to note that during generation, we 
only consider the states seen in training. 

To perform the response generation, we use the lan- 
guage models to define a Markov model. This Markov 
model can be thought of as a large finite state machine 
where each transition occurs based on a transition prob- 
ability rather than an input. As well, each “next state” is 
conditioned solely on the previous state. The transition 
probability is derived directly from the language models. 
The transition probability from a l-gram, w; to a 2-gram, 
WW is P(w2|w1), and so on. Intuitively, generation is 
accomplished by conducting a probabilistic simulation 
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from the start state (i.e., BOM) to the end state (i.e., EOM). 


More specifically, to generate a response, we perform 
a random walk on the Markov model corresponding to 
the identified request cluster. From the BOM state, we 
randomly choose among the possible next states with the 
probabilities present in the language model. For instance, 
if the letters (6,C,D) can follow A with probabilities 
(70%, 20%, 10%) respectively, then we will choose the 
AB path approximately 70% of the time and similarly 
for AC and AD. We use this random walk to create re- 
sponses similar to those seen in training not only in syn- 
tax but also in frequency. Ideally, we would produce the 
same types of responses with the same frequency as those 
seen during training, but the probabilities used are at the 
1-gram level and not the response packet level. 


The Markov models used to generate responses at- 
tempt to generate valid responses based on the training 
data. However, because the training is over the entire 
set of responses corresponding to a cluster, we cannot 
recognize contextual dependencies between requests and 
responses. Protocols will often have session identifiers 
or tokens that necessarily need to be mirrored between 
request and response. DNS, for instance, has a two byte 
session identifier in the request that needs to appear in 
any valid response. As well, the DNS name or IP re- 
quested also needs to appear in the response. While the 
NLP engine will recognize that some session identifier 
and domain name should occupy the correct positions in 
the response, it is unlikely that the correct session iden- 
tifier and name will be chosen. For this reason, we au- 
tomatically post-process the NLP generated response to 
appropriately satisfy contextual dependencies. 
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Figure 3: A sliding window template traverses request 
tokens to identify variable-length tokens that should be 
reproduced in related responses. 


4.1.1 Detecting Contextual Dependencies 


Generally speaking, protocols have two classes of con- 
textual dependencies: invariable length tokens and vari- 
able length tokens. Invariable length tokens are, as the 
name implies, tokens that always contain the same num- 
ber of bytes. For the most part, protocols with vari- 
able length tokens typically adhere to one of two stan- 
dards: tokens preceded by a length field and tokens sep- 
arated using a special byte delimiter. Overwhelmingly, 
protocols use length-preceded tokens (DNS, Samba, 
Netbios, NFS, etc.). The other less-common type 
(as in HTTP) employ variable length delimited tokens. 

Our method for handling each of these token types dif- 
fers only slightly from techniques employed by other ac- 
tive responder and protocol disassembly techniques ([8, 
9]). Specifically, we identify contextual dependencies 
using two techniques. First, we apply the Needleman- 
Wunsch string alignment algorithm [19] to align requests 
with their associated responses during training. Since the 
language models we use are not well suited for this par- 
ticular task, this process is used to identify if, and where, 
substrings from a request also appear in its response. If 
certain bytes or sequences of bytes match over an em- 
pirically derived threshold (80% in our case), these bytes 
are considered invariable length tokens and the byte po- 
sitions are copied from request to response after the NLP 
generation phase. 

To identify variable length tokens, we make the sim- 
plifying assumption that these types of tokens are pre- 
ceded by a length identifier; we do so primarily because 
we are unaware of any protocols that contain contex- 
tual dependencies between request and response through 
character-delimited variable length tokens. As depicted 
in Figure 3, we iterate over each request and consider 
each set of up to four bytes as a length identifier if and 
only if the token that follows it belongs to a certain char- 
acter class* for the described length. In our example, 
Token, is identified as a candidate length-field based 
upon its value. Since the next immediate token is of the 
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Figure 4: Frequency of dominant type/class per request 
cluster (with k = 30), sorted from least to most accurate. 


length described by Token, (i.e., 8), Token; is identi- 
fied as a variable length token. For each variable length 
token discovered, we search for the same token in the 
observed response. We copy these tokens after NLP gen- 
eration if and only if this matching behavior was com- 
mon to more than half of the request and response pairs 
observed throughout training. 

As an aside, the content-length header field in our 
HTTP responses also needs to accurately reflect the num- 
ber of bytes contained in each response. If the value 
of this field is greater than the number of bytes in a 
response, the recipient will poll for more data, causing 
transactions to stall indefinitely. Similarly, if the value of 
the content-length field is less than the number of bytes 
in the response, the recipient will prematurely halt and 
truncate additional data. While other approaches have 
been suggested for automatically inferring fields of this 
type, we simply post-process the generated HTTP re- 
sponse and automatically set the content-length value to 
be the number of bytes after the end-of-header character. 


5 Validation 


In order to assess the correctness of our dynamic re- 
sponse generation techniques, we validate our overall ap- 
proach in the context of DNS. Again, we reiterate that 
our choice for using DNS in this case is because it is 
a rigid binary protocol, and if we can correctly gener- 
ate dynamic responses for this protocol, we believe it 
aptly demonstrates the strength (and soundness) of our 
approach. For our subsequent evaluation, we train our 
DNS responder off a week’s worth of raw network traces 
collected from a public wireless network used by approx- 
imately 50 clients. The traffic was automatically par- 


USENIX Association 


titioned into request and response tuples as outlined in 
Section 4. 
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Figure 5: Frequency of dominant type/class per response 
cluster (with k = 30), sorted from least to most accurate. 


To validate the output of our clustering technique, 
we consider clustering of requests successful if for each 
cluster, one type of request (A, MX, NS, etc.) and the class 
of the request (IN) emerges as the most dominant mem- 
ber of the cluster; a cluster with one type and one class 
appearing more frequently than any other is likely to cor- 
rectly classify an incoming request and, in turn, generate 
a response to the correct query type and class. We report 
results based on using 10,000 randomly selected flows 
for training. As Figures 4 and 5 show, nearly all clusters 
have a dominating type and class. 

To demonstrate our response generation’s success rate, 
we performed 20,000 DNS requests on randomly gen- 
erated domain names (of varying length). We used 
the UNIX command host? to request several types of 
records. For validation purposes, we consider a response 
as strictly faithful if it is correctly interpreted by the re- 
questing program with no warnings or errors. Likewise, 
we consider a response as valid if it processes correctly 
with or without warnings or errors. The results are shown 
in Figure 6 for various training flow sizes. Notice that 
we achieve a high success rate with as little as 5,000 
flows, with correctness ranging between 89% and 92% 
for strictly faithful responses, and over 98% accuracy in 
the case of valid responses. 

In summary, this demonstrates that the overall design 
depicted in Figures | and 2—that embodies our train- 
ing phase, classification phase, model generation, and 
preposing phase to detect contextual dependencies and 
correctly mirror the representative tokens in their correct 
location(s)—produces faithful responses. More impor- 
tantly, these responses are learned automatically, and re- 
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quire little or no manual intervention. In what follows, 
we further substantiate the utility of our approach in the 
context of a web-based honeypot. 
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6 Evaluation 


Our earlier assertion was that the exploitation of web- 
apps now pose a serious threat to the Internet. In order 
to gauge the extent to which this is true, we used our dy- 
namic generation techniques to build a lightweight HTTP 
responder — in the hope of snatching attack traffic tar- 
geted at web applications. These attackers query popular 
search engines for strings that fingerprint the vulnerable 
software and isolate their targets. 





Table 1: Query Types 


With this in mind, we obtained a list of the 3,285 of 
the most searched queries on Google by known botnets 
attempting to exploit web applications.> We then queried 
Google for the top 20 results associated with each query. 
Although there are several bot queries that are ambigu- 
ous and are most likely not targeting a specific web ap- 
plication, most of the queries were targeted. However, 
automatically determining the number of different web 
applications being attacked is infeasible, if not impossi- 
ble. For this reason, we provide only the break down in 
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the types of web applications being exploited (i.e., PHP, 
Perl, CGI, etc.) in Table 1. 

Nearly all of the bots searching for these queries ex- 
ploit command injection vulnerabilities. The PHP vul- 
nerabilities are most commonly exploited through re- 
mote inclusion of a PHP script, while the Perl vul- 
nerabilities, are usually exploited with UNIX delimiters 
and commands. Since CGI/HTML/PHTML can house 
programs from many different types of underlying lan- 
guages, they encompass a wide range of exploitation 
techniques. The collected data contains raw traces of the 
interactions seen when downloading the pages for each 
of the returned results. Our corpus contained 178,541 
TCP flows, of which we randomly selected 24,000 flows 
as training data for our real-world deployment (see Sec- 
tion 6.1). 

Since our primary goal here is to detect (and catch) 
bots using search engines to query strings present in vul- 
nerable web applications, our responder must be in a po- 
sition to capture these prey — i.e., it has to be broadly 
indexed by multiple search engines. To do so, we first 
created links to our responder from popular pages,° and 
then expedited the indexing process by disclosing the 
existence of a minor bug in a common UNIX applica- 
tion to the Full-Disclosure mailing list. The bug we 
disclosed cannot be leveraged for privilege escalation. 
Bulletins from Full-Disclosure are mirrored on several 
high-ranking websites and are crawled extensively by 
search-engine spiders; less than a few hours later, our 
site appeared in search results on two prominent search 
engines. And, right on queue, the attacks immediately 
followed. 


6.1 Real-World Deployment 


For our real-world evaluation, we deployed our system 
on a 3.0 GHz dual-processor Intel Xeon with 8 GB of 
RAM. Atruntime, memory utilization peaked at 960 MB 
of RAM when trained with 24,000 flows. CPU utiliza- 
tion remained at negligible levels throughout operation 
and on average, requests are satisfied in less than a sec- 
ond. Because our design was optimized to purposely 
keep all data RAM during runtime, disk access was un- 
necessary. 

Shortly after becoming indexed, search-worms began 
to attack at an alarming rate, with the attacks rapidly 
increasing over a two month deployment period. Dur- 
ing that time, we also recorded the number of indexes 
returned by Google per day (which totaled just shy of 
12,000 during the deployment). We choose to only show 
PHP attacks because of their prominence. Figure 7 de- 
picts the number of attacks we observed per day. For 
reference, we provide annotations of our Google index 
count in ten day intervals until the indices plateau. 
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Figure 7: Daily PHP attacks. The valley on day 44 is due 
to an 8 hr power outage. The peak on day 56 is because 
two bots launched over 2,000 unique script attacks. 


For ease of exposition, we categorize the observed at- 
tacks into four groups. The first denotes the number of at- 
tacks targeting vulnerabilities that have distinct file struc- 
tures in their names. The class “Unique PHP attacks”, 
however, is more refined and represents the number of at- 
tacks against scripts but using unique injection variables 
(i.e., index.php?page= and index.php?inc=). 
The reason we do so is that the file names and struc- 
tures can be ubiquitous and so by including the vari- 
able names we glean insights into attacks against poten- 
tially distinct vulnerabilities. We also attempt to quan- 
tify the number of distinct botnets involved in these at- 
tacks. While many botnets attack the same applica- 
tion vulnerabilities, (presumably) these botnets can be 
differentiated by the PHP script(s) they remotely in- 
clude. Recall that a typical PHP remote-include exploit is 
of the form “vulnerable.php?variable=http: 
//site.com/attack\-script?”, and in practice, 
botnets tend to use disjoint sites to store attack scripts. 
Therefore, we associate bots with a particular botnet by 
identifying unique injection script repositories. Based on 
this admittedly loose notion of uniqueness [27], we ob- 
served attacks from 5,648 distinct botnets. Lastly, we 
record the number of unique IP addresses that attempt to 
compromise our responder. 

The results are shown in Figure 7. An immediate ob- 
servation is the sheer volume of attacks—in total, well 
over 368,000 attacks targeting just under 45,000 unique 
scripts before we shutdown the responder. _Interest- 
ingly, notice that there are more unique PHP attacks than 
unique IPs, suggesting that unlike traditional scanning 
attacks, these bots query for and attack a wide variety 
of web applications. Moreover, while many bots attempt 
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to exploit a large number of vulnerabilities, the repos- 
itories hosting the injected scripts remain unchanged 
from attack to attack. The range of attacks is perhaps 
better demonstrated not by the number of unique PHP 
scripts attacked but by the number of unique PHP web- 
applications that are the target of these attacks. 


6.1.1 Unique WebApps 


In general, classifying the number of unique web ap- 
plications being attacked is difficult because some bots 
target PHP scripts whose filenames are ubiquitous (e.g., 
index.php). In these cases, bots are either targeting 
a vulnerability in one specific web-application that hap- 
pens to use a common filename or arbitrarily attempting 
to include remote PHP scripts. 

To determine if an attack can be linked to a specific 
web-application, we downloaded the directory structures 
for over 4,000 web-applications from SourceForge.net. 
From these directory structures, we matched the web 
application to the corresponding attacked script (e.g., 
gallery.php might appear only in the Web Gallery 
web application). Next, we associated an attack with 
a specific web application if the file name appeared in 
no more than 10 web-app file structures. We choose a 
threshold of 10 since SourceForge stores several copies 
of essentially the same web application under different 
names (due to, for instance, “skin” changes or different 
code maintainers). For non-experimental deployments 
aimed at detecting zero-day attacks, training data could 
be associated with its application of origin, thereby mak- 
ing associations between non-generic attacks and spe- 
cific web-applications straightforward. 

Based on this heuristic, we are able to map the 24,000 
flows we initially trained on to 560 “unique” web- 
applications. Said another way, by simply building our 
language models on randomly chosen flows, we were 
able to generate content that approximates 560 distinct 
web-applications — a feat that is not as easy to achieve 
if we were to deploy each application on a typical web- 
based honeypot (e.g., the Google Hack Honeypot [3]). 
The attacks themselves were linked back to 295 distinct 
web applications, which is indicative of the diversity of 
attacks. 

We note that our heuristic to map content to web- 
apps is strictly a lower bound as it only identifies web- 
applications that have a distinct directory structure and/or 
file name; a large percentage of web-applications use 
index.php and other ubiquitous names and are there- 
fore not accounted for. Nonetheless, we believe this 
serves to make the point that our approach is effective 
and easily deployable, and moreover, provides insight 
into the amount of web-application vulnerabilities cur- 
rently being leveraged by botnets. 
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6.1.2 Spotting Emergent Threats 


While the original intention of our deployment was to 
elicit interaction from malware exploiting known vulner- 
abilities in web applications, we became indexed under 
broader conditions due to the high amount of variabil- 
ity in our training data. As a result, a honeypot or ac- 
tive responder indexed under such a broad set of web ap- 
plications can, in fact, attract attacks targeting unknown 
vulnerabilities. For instance, according to milwOrm (a 
popular security advisory/exploit distribution site), over 
65 PHP remote inclusion vulnerabilities were released 
during our two month deployment [1]. Our deployment 
began on October 27°”, 2007 and used the same training 
data for its entire duration. Hence, any attack exploiting 
a vulnerability released after October 27” is an attack 
we did not explicitly set out to detect. 

Nonetheless, we witnessed several emergent threats 
(some may even consider them “zero-day” attacks) be- 
cause some of the original queries used to bootstrap 
training were generic and happened to represent a wide 
number of webapps. As of this writing, we have iden- 
tified more than 10 attacks against vulnerabilities that 
were undisclosed at deployment time (some examples 
are illustrated in Table 2). It is unlikely that we wit- 
nessed these attacks simply because of arbitrary attempts 
to exploit random websites—indeed, we never witnessed 
many of the other disclosed vulnerabilities being at- 
tacked. 

We argue that given the frequency with which these 
types of vulnerabilities are released, a honeypot or an ac- 
tive responder without dynamic content generation will 
likely miss an overwhelming amount of attack traffic—in 
the attacks we witnessed, botnets begin attacking vulner- 
able applications on the day the vulnerability was pub- 
licly disclosed! An even more compelling case for our 
architecture is embodied by attacks against vulnerabili- 
ties that have not been disclosed (e.g., the recent Word- 
Press vulnerability [7]). We believe that the potential to 
identify these attacks exemplifies the real promise of our 
approach. 


6.2 Dissecting the Captured Payloads 


To better understand what the post-infection process en- 
tails, we conducted a rudimentary analysis of the re- 
motely included PHP scripts. Our malware analysis was 
performed on a Linux based Intel virtual machine with 
the 2.4.7 kernel. We used a deprecated kernel version 
since newer versions do not export the system call ta- 
ble of which we take advantage. Our environment con- 
sisted of a kernel module and a preloaded library’ that 
serve to inoculate malware before execution and to log 
interesting behavior. The preloaded library captures calls 
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Disclosure Date | Attack Date 


2007-11-04 
2007-11-21 
2007-11-22 
2007-11-25 
2007-11-28 


2007-11-10 
2007-11-23 
2007-11-22 
2007-11-25 
2007-11-28 


/starnet/themes/c-sky/main.inc.php?cmsdir= 
/comments-display-tpl.php?language_file= 


/admin/kfm/initialise.php?kfmbase_path= 
/Commence/includes/db._connect .php?phproot\-path= 
/decoder/gallery.php?ccms_library_path= 





Table 2: Attacks targeting vulnerabilities that were unknown at time of deployment 


to connect () and send(). The connect hook de- 
ceives the malware by faking successful connections, and 
the send function allows us to record information trans- 
mitted over sockets.® 


Our kernel module hooks three system calls: (open, 
write, and execve). We execute every script under 
a predefined user ID, and interactions under this ID are 
recorded via the open () hook. We also disallow calls to 
open that request write access to a file, but feign success 
by returning a special file descriptor. Attempts to write 
to this file descriptor are logged via syslog. Doing so 
allows us to record files written by the malware without 
allowing it to actually modify the file system. Similarly, 
only commands whose file names contain a pre-defined 
random password are allowed to execute. All other com- 
mand executions under the user ID fail to execute (but 
pretend to succeed), assuring no malicious commands 
execute. Returning success from failed executions is im- 
portant because a script may, for example, check if a 
command (e.g., wget) successfully executes before re- 
questing the target URL. 


To determine the functionality of the individual mal- 
ware scripts, we batched processed all the captured mal- 
ware on the aforementioned architecture. From the tran- 
scripts provided by the kernel module and library, we 
were able to discern basic functionality, such as whether 
or not the script makes connections, issues IRC com- 
mands, attempts to write files, etc. In certain cases, we 
also conducted more in-depth analyses by hand to un- 
cover seemingly more complex functionality. We discuss 
our findings in more detail below. 


The high-level break-down for the observed scripts is 
given in Table 3. The challenge in capturing bot pay- 
loads in web application attacks stems from the ease with 
which the attacker can test for a vulnerability; unique 
string displays (where the malware echoes a unique to- 
ken in the response to signify successful exploitation) 
accounts for the most prevalent type of injection. Typ- 
ically, bots parse returned responses for their identifying 
token and, if found, proceed to inject the actual bot pay- 
load. Since these unique tokens are unlikely to appear 
in our generated response, we augment our responder to 
echo these tokens at run-time. While the use of random 
numbers as tokens seem to be the soup du jour for testing 
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Script Classification Instances 
PHP Web-based Shells 834 
Echo Notification 591 
PHP Bots 377 
Spammers 347 
Downloaders 182 
Perl Bots 136 
Email Notification 87 
Text Injection 35 
Java-script Injection 18 
Information Farming 9 
Uploaders 4 
Image Injection 4 
UDP Flooders 3 











Table 3: Observed instances of individual malware 


a vulnerability, we observed several instances where at- 
tackers injected an image. Somewhat comically, in many 
cases, the bot simply e-mails the IP address of the vulner- 
able machine, which the attacker then attempts to exploit 
at a later time. The least common vulnerability test we 
observed used a connect-back operation to connect to an 
attacker-controlled system and send vulnerability infor- 
mation to the attacker. This information is presumably 
logged server-side for later use. 

Interestingly, we notice that bots will often inject sim- 
ple text files that typically also contain a unique identi- 
fying string. Because PHP scripts can be embedded in- 
side HTML, PHP requires begin and end markers. When 
a text file is injected without these markers, its contents 
are simply interpreted as HTML and displayed in the out- 
put. This by itself is not particularly interesting, but we 
observed several attackers injecting large lists of queries 
to find vulnerable web applications via search engines. 
The largest query list we captured contained 7,890 search 
queries that appear to identify vulnerable web applica- 
tions — all of which could be used to bootstrap our con- 
tent generation further and cast an even wider net. 

Overall, the collected malware was surprisingly mod- 
ular and offered diverse functionality similar to that re- 
ported elsewhere [26, 15, 13, 12, 25, 5, 4]. The cap- 
tured scripts (mostly PHP-based command shells), are 
advanced enough that many have the ability to display 
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the output in some user-friendly graphical user interface, 
obfuscate the script itself, clean the logs, erase the script 
and related evidence, deface a site, crawl vulnerability 
sites, perform distributed denial of service attacks and 
even perform automatic self-updates. In some cases, the 
malware inserted tracking cookies and/or attempted to 
gain more information about a system’s inner-workings 
(e.g., by copying /etc/passwd and performing local 
banner scans). To our surprise, only eight scripts con- 
tained functionality to automatically obtain root. In 
these cases, they all used C-based kernel vulnerabili- 
ties that write to the disk and compile upon exploita- 
tion. Lastly, IRC was used almost exclusively as the 
communication medium. As can be expected, we also 
observed several instances of spamming malware us- 
ing e-mail addresses pulled from the web-application’s 
MySQL database backend. In a system like phpBB, 
this can be highly effective because most forum users 
enter an e-mail address during the registration process. 
Cross-checking the bot IPs with data from the Spamhaus 
project [2] shows that roughly 36% of them currently ap- 
pear in the spam black list. 

One noteworthy functionality that seems to transcend 
our categorizations among PHP scripts is the ability to 
break out of PHP safe mode. PHP safe mode disables 
functionality for, among others, executing system com- 
mands, modifying the file system, etc. The malware we 
observed that bypass safe mode tend to contain a hand- 
ful of known exploits that either exploit functionality 
in PHP, functionality in mysql, or functionality in web 
server software. Lastly, we note that although we ob- 
served what appeared to be over 5,648 unique injection 
scripts from distinct botnets, nearly half of them point 
to zombie botnets. These botnets no longer have a cen- 
tralized control mechanism and the remotely included 
scripts are no longer accessible. However, they are still 
responsible for an overwhelming amount of our observed 
HTTP traffic. 


6.3 Limitations 


One might argue that a considerably less complex (but 
more mundane) approach for eliciting search worm traf- 
fic may be to generate large static pages that con- 
tain content representative of a variety of popular web- 
applications. However, simply returning arbitrary or 
static pages does not yield either the volume or diver- 
sity of attacks we observed. For instance, one of our 
departmental websites (with a much higher PageRank 
than our deployment site) only witnessed 437 similar at- 
tacks since August 2006. As we showed in Section 6, 
we witnessed well over 368,000 attacks in just over two 
months. Moreover, close inspection of the attacks on the 
university website show that they are far less varied or 
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interesting. These attacks seem to originate from either 
a few botnets that issue “loose” search queries (e.g., “in- 
url:index.php”’) and subsequently inject their attack, or 
simply attack ubiquitous file names with common vari- 
able names. Not surprisingly, these unsophisticated bot- 
nets are less widespread, most likely because they fail 
to infect many hosts. By contrast, the success of our 
approach lead to more insightful observations about the 
scope and diversity of attacks because we were able to 
cast a far wider net. 


That said, for real-world honeypot deployments, de- 
tection and exploitation of the honeypot itself can be a 
concern. Clearly, our system is not a true web-server 
and like other honeypots [23], it too can be trivially de- 
tected using various fingerprinting techniques [14]. More 
to the point, a well-crafted bot that knows that a partic- 
ular string always appears in pages returned by a given 
web-application could simply request the page from us 
and check for the presence of that string. Since we will 
likely fail to produce that string, our phony will be de- 
tected’. 


The fact that our web-honeypot can be detected is a 
clear limitation of our approach, but in practice it has not 
hindered our efforts to characterize current attack trends, 
for several reasons. First, the search worms we witnessed 
all seemed to use search engines to find the identifying 
information of a web-application, and attacked the vul- 
nerability upon the first visit to the site; presumably be- 
cause verifying that the response contains the expected 
string slows down infection. Moreover, it is often times 
difficult to discern the web-application of origin as many 
web-applications do not necessarily contain strings that 
uniquely identify the software. Indeed, in our own analy- 
sis, we often had difficulty identifying the targeted web- 
application by hand, and so automating this might not be 
trivial. 


Lastly, we argue that the limitations of the approach 
proposed herein manifests themselves as trade-offs. Our 
decision to design a stateless system results in a memory- 
efficient and lightweight deployment. However, this de- 
sign choice also makes handling stateful protocols nearly 
impossible. It is conceivable that one can convert our 
architecture to better interact with stateful protocols by 
simply changing some aspects of the design. For in- 
stance, this could be accomplished by incorporating flow 
sequence information into training and then recalling its 
hierarchy during generation (e.g., by generating a re- 
sponse from the set of appropriate first round responses, 
then second round responses, etc.). To capture multi- 
stage attacks, however, ScriptGen [18, 17] may be a bet- 
ter choice for emulating multi-stage protocol interaction, 
and can be used in conjunction with our technique to cast 
a wider net to initially entice such malware. 
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7 Conclusion 


In this paper, we use a number of multi-disciplinary tech- 
niques to generate dynamic responses to protocol in- 
teractions. We demonstrate the utility of our approach 
through the deployment of a dynamic content generation 
system targeted at eliciting attacks against web-based 
exploits. During a two month period we witnessed an 
unrelenting barrage of attacks from attackers that scour 
search engine results to find victims (in this case, vulner- 
able web applications). The attacks were targeted at a 
diverse set of web applications, and employed a myriad 
of injection techniques. We believe that the results herein 
provide valuable insights on the nature and scope of this 
increasing Internet threat. 
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formation on how to get access to this data, please see 
http:/spar.isi.jhu.edu/botnetdata/. 
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Notes 


! The drawback, of course, is that high-interaction honeypots are 
a heavy-weight solution, and risk creating their own security prob- 
lems [23]. 

2Protocol messages are tokenized similarly in [18, 17] and [8]. 

3In practice, we use printable and non-printable. 

4The results are virtually the same for nslookup, and hence, omit- 
ted. 

5 These initial queries were provided by one of the authors, but simi- 
lar results could easily be achieved by crawling the WebApp directories 
in SourceForge and searching Google for identifiable strings (similar to 
what we outline in Section 6.1.1). 

©We placed links on 3 pages with Google PageRank ranking of 6, 2 
pages with rank 5, 3 pages with rank 2, and 5 pages with rank 0. 

7A preloaded library loads before all other libraries in order to hook 
certain library functions 

8 Because none of the malware we obtained use direct system calls 
to either connect () or send (), this setup suffices for our needs. 

°Notice however that if a botnet has n bots conducting an attack 
against a particular web-application, we only need to probabilistically 
return what the malware is seeking 1/ n*? of the time to capture the 
malicious payload. 
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Abstract 
The security of embedded devices often relies on the secrecy of proprietary cryptographic algorithms. These 
algorithms and their weaknesses are frequently disclosed through reverse-engineering software, but it is 
commonly thought to be too expensive to reconstruct designs from a hardware implementation alone. This 
paper challenges that belief by presenting an approach to reverse-engineering a cipher from a silicon imple- 
mentation. Using this mostly automated approach, we reveal a cipher from an RFID tag that is not known 
to have a software or micro-code implementation. We reconstruct the cipher from the widely used Mifare 
Classic RFID tag by using a combination of image analysis of circuits and protocol analysis. Our analysis re- 
veals that the security of the tag is even below the level that its 48-bit key length suggests due to a number of 
design flaws. Weak random numbers and a weakness in the authentication protocol allow for pre-computed 
rainbow tables to be used to find any key in a matter of seconds. Our approach of deducing functional- 
ity from circuit images is mostly automated, hence it is also feasible for large chips. The assumption that 


algorithms can be kept secret should therefore to be avoided for any type of silicon chip. 


Il faut qu’il n’exige pas le secret, et qu’il puisse sans inconvénient tomber entre les mains de l’ennemi. 
({A cipher] must not depend on secrecy, and it must not matter if it falls into enemy hands.) 
August Kerckhoffs, La Cryptographie Militaire, January 1883 [13] 


1 Introduction 


It has long been recognized that security-through-obscur- 
ity does not work. However, vendors continue to be- 
lieve that if an encryption algorithm is released only as 
a hardware implementation, then reverse-engineering the 
cipher from hardware alone is beyond the capabilities of 
likely adversaries with limited funding and time. The 
design of the cipher analyzed in this paper, for example, 
had not been disclosed for 14 years despite more than a 
billion shipped units. We demonstrate that the cost of re- 
verse engineering a cipher from a silicon implementation 
is far lower than previously thought. 


In some cases, details of an unknown cryptographic ci- 
pher may be found by analyzing the inputs and outputs 
of a black-box implementation. Notable examples in- 
clude Bletchley Park’s breaking the Lorenz cipher during 
World War II without ever acquiring a cipher machine 
[5] and the disclosure of the DST cipher used in cryp- 
tographic Radio Frequency Identification (RFID) tokens 
from Texas Instruments [4]. In both cases, researchers 
started with a rough understanding of the cipher’s struc- 
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ture and were able to fill in the missing details through 
cryptanalysis of the cipher output for known keys and 
inputs. This black-box approach requires some prior un- 
derstanding of the structure of a cipher and is only appli- 
cable to ciphers with statistical weaknesses. The output 
of a sound cipher should not be statistically biased and 
therefore should not leak information about its structure. 


Other ciphers have been disclosed through disassem- 
bly of their software implementation. Such implemen- 
tations can either be found in computer software or as 
microcode on an embedded micro-controller. Ciphers 
found through software disassembly include the A5/1 
and A5/2 algorithms that secure GSM cell phone com- 
munication [1] and the Hitag2 and Keeloq algorithms 
used in car remote controls [3]. The cryptography on 
the RFID tags we analyzed is not known to be available 
in software or in a micro-code implementation; tags and 
reader chips implement the cipher entirely in hardware. 


In this paper, we focus on revealing proprietary cryptog- 
raphy from its silicon implementation alone. Reverse- 
engineering silicon is possible even when very little is 
known about a cipher and no software implementation 
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exists. The idea of reverse-engineering hardware is not 
new. Hardware analysis is frequently applied in indus- 
try, government, and the military for spying, security as- 
sessments, and protection of intellectual property. Such 
reverse-engineering, however, is usually considered pro- 
hibitively expensive for typical attackers, because of the 
high prices charged by professionals offering this ser- 
vice. The key contribution of this work is demonstrating 
that reverse-engineering silicon is cheap and that it can 
be mostly automated. This is the first published work 
to describe the details of reverse-engineering a crypto- 
graphic function from its silicon implementation. We 
describe a mostly automated process that can be used 
to cheaply determine the functionality of previously un- 
known cipher implementations. 


We demonstrate the feasibility of our approach by reveal- 
ing the cipher implemented on the NXP Mifare Clas- 
sic RFID tags, the world’s most widely used crypto- 
graphic RFID tag [16]. Section 2 describes our reverse- 
engineering method and presents the cipher. Section 3 
discusses several weaknesses in the cipher beyond its 
short key size. Weak random numbers combined with 
a protocol flaw allow for rainbow tables to be computed 
that reduce the attack time from weeks to minutes. Sec- 
tion 4 discusses some potential improvements and de- 
fenses. While we identify fixes that would increase the 
security of the Mifare cipher significantly, we conclude 
that good security may be hard to achieve within the de- 
sired resource constraints. 


2 Mifare Crypto-1 Cipher 


We analyzed the Mifare Classic RFID tag by NXP (for- 
merly Philips). This tag has been on the market for over 
a decade with over a billion units sold. The Mifare Clas- 
sic card is frequently found in access control systems and 
tickets for public transport. Large deployments include 
the Oyster card in London, and the SmartRider card in 
Australia. Before this work, the Netherlands were plan- 
ning to deploy Mifare tags in OV-chipkaart, a nation- 
wide ticketing system, but the system will likely be re- 
engineered after first news about a potential disclosure of 
the card’s details surfaced [17]. The Mifare Classic chip 
currently sells for 0.5 Euro in small quantities, while tags 
with larger keys and established ciphers such as 3-DES 
are at least twice as expensive. 


The cryptography found in the Mifare cards is a stream 
cipher with 48-bit symmetric keys. This key length has 
been considered insecure for some time (for example, the 
Electronic Frontier Foundation’s DES cracking machine 
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demonstrated back in 1998 that a moderately-funded at- 
tacker could brute force 56-bit DES [6]) and the practical 
security that Mifare cards have experienced in the past 
relies primarily on the belief that its cipher was secret. 
We find that the security of the Mifare Classic is even 
weaker than the short key length suggests due to flaws in 
its random number generation and the initialization pro- 
tocol discussed in Section 3. 


The data on the Mifare cards is divided into sectors, each 
of which holds two different keys that may have different 
access rights (e.g., read/write or read-only). This division 
allows for different applications to each store encrypted 
data on a tag—an option rarely used in practice. All se- 
crets are set to default values at manufacturing time but 
changed before issuing the tags to users. Different tags 
in a system may share the same read key or have dif- 
ferent keys. Sharing read keys minimizes the overhead 
of key-distribution to offline readers. We find, however, 
that the protocol level measures meant to prevent differ- 
ent users from impersonating each other are insufficient. 
Unique read and write keys should, therefore, be used for 
each tag and offline readers should be avoided as much 
as possible. 


2.1 Hardware Analysis 


The chip on the Mifare Classic tag is very small with 
a total area of roughly one square millimeter. About a 
quarter of the area is used for 1K of flash memory (a 
4K version is also available); another quarter is occupied 
by the radio front-end and outside connectivity, leaving 
about half the chip area for digital logic including cryp- 
tography. 


The cryptography functions make up about 400 2-NAND 
gate equivalents (GE), which is very small even com- 
pared to highly optimized implementations of standard 
cryptography. For example, the smallest known imple- 
mentation of the AES block cipher (which was specif- 
ically designed for RFID tags) requires 3400 GEs [7]. 
The cryptography on the Mifare tags is also very fast and 
outputs | bit of key stream in every clock cycle. The AES 
circuit, by comparison, takes 1000 clock cycles for one 
128-bit AES operation (10 milliseconds on a tag running 
at 106 kHz). 


To reverse engineer the cryptography, we first had to get 
access to sample chips, which are usually embedded in 
credit card size plastic cards. We used acetone to dis- 
solve the plastic card, leaving only the blank chips. Ace- 
tone is easier and safer to handle than alternatives such 
as fuming nitric acid, but still dissolves plastic cards in 
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Figure 1: (a) Source image of layer 2 after edge detection; (b) after automated template detection. 


about half an hour. Once we had isolated the silicon 
chips, we removed each successive layer through me- 
chanical polishing, which we found easier to control than 
chemical etching. Simple polishing emulsion or sandpa- 
per with very fine grading of 0.04, suffices to take off 
micrometer-thick layers within minutes. 


Although the polishing is mostly straightforward, the one 
obstacle to overcome is the chip tilting. Since the chip 
layers are very close together, even the smallest tilt leads 
to cuts through several layers. We addressed this problem 
in two ways. First, we embedded the millimeter-size chip 
in a block of plastic so it was easier to handle. Second, 
we accpeted that we could not completely avoid tilt using 
our simple equipment and adapted our image stitching 
tools to patch together chip layers from several sets of 
pictures, each imaging parts of several layers. 


The chip contains a total of six layers, the lowest of 
which holds the transistors. We took pictures using a 
standard optical microscope at a magnification of 500x. 
From multiple sets of these images we were able to au- 
tomatically generate images of each layer using tech- 
niques for image tiling that we borrowed from panorama 
photography. We achieved the best results using the 
open source tool hugin (http://hugin.sourceforge.net/) 
by setting the maximum variance in viewer angle to a 
very small value (e.g., 0.1°) and manually setting a few 
control points on each image. 
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The transistors are grouped in gates that each perform 
a logic function such as AND, XOR, or flip-flop as il- 
lustrated in Figure 1. Across the chip there are several 
thousand such logic gates, but only about 70 different 
types of gates. As a first step toward reconstructing the 
circuit, we built a library of these gates. We implemented 
template matching that given one instance of a logic gate 
finds all the other instances of the same gate across the 
chip. Our tools take as input an image of layer 2, which 
represents the logic level, and the position of instances 
of different logic gates in the image. The tools then use 
template matching to find all other instances of the gate 
across the image, including rotated and mirrored vari- 
ants. Since larger gates sometimes contain smaller gates 
as building blocks, the matching is done in order of de- 
creasing gate sizes. 


Our template matching is based on normalized cross- 
correlation which is a well-known similarity test [14] 
and implemented using the MATLAB image process- 
ing library. Computing this metric is computationally 
more complex than standard cross-correlation, but the 
total running time of our template matching is still un- 
der ten minutes for the whole chip. Normalized cross- 
correlation is insensitive to the varying brightness across 
our different images and the template matching is able 
to find matches with high accuracy despite varying col- 
oration and distortion of the structures that were caused 
by the polishing. 
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We then manually annotated each type of gate with its 
respective functionality. This step could be automated 
as well through converting the silicon-level depiction of 
each gate into a format suitable for a circuit simulation 
program. We decided against this approach because the 
overhead seemed excessive. For larger libraries that per- 
haps intentionally vary the library cells in an attempt 
to impede reverse-engineering, however, automation is 
certainly possible and has already been demonstrated in 
other projects [2]. 


Our template matching provides a map of the different 
logic gates across the chip. While it would certainly have 
been possible to reverse-engineer the whole RFID tag, 
we focused our attention on finding and reconstructing 
the cryptographic components. We knew that the stream 
cipher would have to include at least a 48-bit register and 
a number of XOR gates. We found these components in 
one of the corners of the chip along with a circuit that 
appeared to be a random number generator as it has an 
output, but no input. 


Focusing our efforts on only these two parts of the chip, 
we reconstructed the connections between all the logic 
gates. This step involved considerable manual effort and 
was fairly error-prone. All the errors we made were 
found through a combination of redundant checking and 
Statistical tests for some properties that we expected the 
cipher to have such as an even output distribution of 
blocks in the filter function. We have since implemented 
scripts to automate the detection of wires, which can 
speed the process and improve its accuracy. Using our 
manually found connections as ground truth we find that 
our automated scripts detect the metal connection and 
intra-layer vias correctly with reasonably high probabil- 
ity. In our current tests, our scripts detect over 95% of the 
metal connections correctly and the few errors they make 
were easily spotted manually by overlaying the source 
image and the detection result. These results are, how- 
ever, preliminary, as many factors are not yet accounted 
for. To assess the potential for automation more thor- 
oughly, we plan to test our tools on different chips, us- 
ing different imaging systems, and having different users 
check the results. 


In the process of reconstructing the circuit, we did not 
encounter any added obscurity or tamper-proofing. Be- 
cause the cryptographic components are highly struc- 
tured, they were particularly easy to reconstruct. Fur- 
thermore, we could test the validity of different building 
blocks by checking certain statistical properties. For ex- 
ample, the different parts of the filter function each have 
an even output distribution so that the output bits are not 
directly disclosing information about single state bits. 
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The map of logic gates and the connections between 
them provides us with almost enough information to dis- 
cover the cryptographic algorithm. Because we did not 
reverse-engineer the control logic, we do not know the 
exact timing and inputs to the cipher. Instead of recon- 
structing more circuitry, we derived these missing pieces 
of information from protocol layer communication be- 
tween the Mifare card and reader. 


2.2 Protocol Analysis 


From the discovered hardware circuit, we could not de- 
rive which inputs are shifted into the cipher in what or- 
der, partly because we did not reverse the control logic, 
but also because even with complete knowledge of the 
hardware we would not yet have known what data differ- 
ent memory cells contain. To add the missing details to 
the cipher under consideration, and to verify the results 
of the hardware analysis, we examined communication 
between the Mifare tags and a Mifare reader chip. 


An NXP reader chip is included on the OpenPCD open 
source RFID reader, whose flexibility proved to be cru- 
cial for the success of our project. The OpenPCD in- 
cludes an ARM micro-controller that controls the com- 
munication between the NXP chip and the Mifare card. 
This setup allows us to record the communication and 
provides full control over the timing of the protocol. 
Through timing control we can amplify some of the vul- 
nerabilities we discovered as discussed in Section 3. 


No details of the cipher have been published by the man- 
ufacturer or had otherwise been leaked to the public prior 
to this work. We guessed that the secret key and the tag 
ID were shifted into the shift register sequentially rather 
than being combined in a more complicated way. To 
test this hypothesis, we checked whether a reader could 
successfully authenticate against a tag using an altered 
key and an altered ID. Starting with single bit changes 
in ID and key and progressively extending our search to 
larger variations, we found a number of such combina- 
tions that indeed successfully authenticated the reader to 
the tag. From the pattern of these combinations we could 
derive not just the order of inputs, but also the structure 
of the linear feedback shift register, which we had inde- 
pendently found on the circuit level. Combining these in- 
sights into the authentication protocol with the results of 
our hardware analysis gave us the whole Crypto-1 stream 
cipher, shown in Figure 2. 

The cipher is a single 48-bit linear feedback shift register 
(LFSR). From a fixed set of 20 state bits, the one bit of 
key stream is computed in every clock cycle. The shift 
register has 18 taps (shown as four downward arrows in 
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Figure 2: Crypto-1 stream cipher and initialization. 


the figure) that are linearly combined to fill the first reg- 
ister bit on each shift. The update function does not con- 
tain any non-linearity, which by today’s understanding of 
cipher design can be considered a serious weakness. The 
generating polynomial of the register is (with x! being the 
ith bit of the shift register): 


x38 4+ x8 +339 4+ 338 x36 4 34 4 433 4x3! 4 429 
zi x24 4 23 4 2h yl 4 IS 4 P47 4 6 4 5 4 1, 


The polynomial is primitive in the sense that it is irre- 
ducible and generates all ae - 1) possible outputs in 
succession. To confirm this, we converted the Fibonacci 
LFSR into a Galois LFSR for which we can compute any 
number of steps in a few Galois field multiplications. We 
then found that the cipher state repeats after (2** — 1) 
steps, but not after any of the possible factors for this 
number. The LSFR is hence of maximum-length. 


The protocol between the Mifare chip and reader loosely 
follows the ISO 9798-2 specification, which describes an 
abstract challenge-response protocol for mutual authenti- 
cation. The authentication protocol takes a shared secret 
key and a unique tag ID as its inputs. At the end of the 
authentication, the parties have established a session key 
for the stream cipher and both parties are convinced that 
the other party knows the secret key. 


3 Cipher Vulnerabilities 


The 48-bit key used in Mifare cards makes brute-force 
key searches feasible. Cheaper than brute-force attacks, 
however, are possible because of the cipher’s weak cryp- 
tographic structure. While the vulnerability to brute- 
force attacks already makes the cipher weak, the cheaper 
attacks are relevant for many Mifare deployments such 
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as fare collection where the value of breaking a partic- 
ular key is relatively low. Weaknesses of the random 
number generator and the cryptographic protocol allow 
an attacker to pre-compute a codebook and perform key- 
lookups quickly and cheaply using rainbow tables. 


3.1 Brute-Force Attack 


In a brute-force attack an attacker records two challenge- 
response exchanges between the legitimate reader and a 
card and then tries all possible keys for whether they pro- 
duce the same result. 


To estimate the expected time for a brute-force attack, 
we implemented the cipher on FPGA devices by Pico 
Computing. Due to the simplicity of the cipher, 6 fully- 
pipelined instances can be squeezed into a single Xilinx 
Virtex-5 LX50 FPGA. Running the implementation on 
an array of 64 such FPGAs to try all 248 keys takes under 
50 minutes. 


3.2 Random Number Generation 


The random number generator (RNG) used on the Mi- 
fare Classic tags is highly insecure for cryptographic ap- 
plications and further decreases the attack complexity by 
allowing an attacker to pre-compute a codebook. 

The random numbers on Mifare Classic tags are gener- 
ated using a linear feedback shift register with constant 
initial condition. Each random value, therefore, only de- 
pends on the number of clock cycles elapsed between the 
time the tag is powered up (and the register starts shift- 
ing) and the time the random number is extracted. The 
numbers are generated using a maximum length 16-bit 
LEFSR of the form: 
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The register is clocked at 106 kHz and wraps around ev- 
ery 0.6 seconds after generating all 65,535 possible out- 
put values. Aside from the highly insufficient length of 
the random numbers, an attacker that controls the tim- 
ing of the protocol controls the generated number. The 
weakness of the RNG is amplified by the fact that the 
generating LFSR is reset to a known state every time 
the tag starts operating. This reset is completely un- 
necessary, involves hardware overhead, and destroys the 
randomness that previous transactions and unpredictable 
noise left in the register. 


We were able to control the number the Mifare random 
number circuit generated using the OpenPCD reader 
and custom-built firmware. In particular, we were able 
to generate the same “random” nonce in each query, 
thereby completely eliminating the tag randomness from 
the authentication process. Moreover, we found the same 
weakness in the 32-bit random numbers generated by the 
reader chip, which suggests that a similar hardware im- 
plementation is used in the chip and reader. Here, too, 
we were able to repeatedly generate the same number. 
While in our experiments this meant controlling the tim- 
ing of the reader chip, a skilled attacker will likely be 
able to exploit this vulnerability even in realistic scenar- 
ios where no such control over the reader is given. The 
attacker can predict forthcoming numbers from the num- 
bers already seen and precisely chose the time to start 
interacting with the reader in order to receive a certain 
challenge. The lack of true randomness on both reader 
and tag enable an attacker to eliminate any form of ran- 
domness from the authentication protocol. Depending 
on the number of precomputed codebooks, this process 
might take several hours and the attack might not be fea- 
sible against all reader chips. 


3.3. Pre-Computing Keys 


Several weaknesses of the Mifare card design add up to 
what amounts to a full codebook pre-computation. First, 
the key space is small enough for all possible keys to be 
included. Second, the random numbers are controllable. 
In addition, the secret key and the tag ID are combined 
in such a way that for each session key there exists ex- 
actly one key for each ID that would result in that session 
key. The key and the ID are shifted into the register se- 
quentially, but no non-linearity is mixed in during this 
process. As explained in Section 2.2, for every delta of 
ID bits, there exists a delta of key bits that corrects for 
the difference and results in the same session key. There- 
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fore, given a key that for some ID results in a session 
key, there exists a key for any ID that would result in the 
same session key. This bijective mapping allows for a 
codebook that was pre-computed for only a single ID to 
be used to find keys for all other IDs as well. 


A codebook for all keys would occupy 1500 Terabytes, 
but can be stored more economically in rainbow tables. 
Rainbow tables store just enough information from a key 
space for finding any key with high probability, but re- 
quire much less space than a table for all keys [9, 15]. 
Each “rainbow” in these tables is the repeated application 
of slight variants of a cryptographic operation. In our 
case, we start with a random key and generate the output 
of the authentication protocol for this key, then use this 
output as the next key for the authentication, generate its 
output, use that as the next key, and so on. We then only 
store the first and last value of each rainbow, but compute 
enough rainbows so that almost all keys appear in one of 
them. To find a key from such a rainbow table, a new 
rainbow is computed starting at a recorded output from 
the authentication protocol. If any one of the generated 
values in this series is also found in the stored end values 
of the rainbows, then the key used in the authentication 
protocol can be found from the corresponding start val- 
ues of that matching rainbow. The time needed to find a 
key grows as the size of the tables shrinks. 


Determining any card’s secret key will be significantly 
cheaper than trying out all possible keys even for rain- 
bow tables that only occupy a few Terabytes and can 
be almost as cheap as a database lookup. The fact that 
an attacker can use a pre-computed codebook to reveal 
the keys from many cards dramatically changes the eco- 
nomics of an attack in favor of the attacker. This means 
that even attacks on low-value cards like bus tickets 
might be profitable. 


3.4 Threat Summary 


To summarize the threat to systems that rely on Mifare 
encryption for security, we illustrate a possible attack. 
An attacker would first scan the ID from a valid card. 
This number is unprotected and always sent in the clear. 
Next, the attacker would pretend to be that card to a legit- 
imate reader, record the reader message of the challenge- 
response protocol with controlled random nonces, and 
abort the transaction. Given only two of these messages, 
the key of the card can be found in the pre-computed 
rainbow tables in a matter of minutes and then used to 
read the data from the card. This gives the attacker all 
the information needed to clone the card. 
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4 Discussion 


The illustrated attack is yet another example of security- 
by-obscurity failing. Weaknesses in the exposed cipher 
reveal the pitfalls of proprietary cipher design without 
peer-review. A few changes in the design would have 
made some of the discussed attacks infeasible and could 
have increased the key size within the same hardware 
constraints to make brute-force attacks less likely. Much 
better security, however, can only be achieved through 
better, more thoroughly analyzed ciphers. 


4.1 Potential Fixes 


The system is vulnerable against codebook attacks be- 
cause of its weak random numbers and the linear combi- 
nation of key and ID. Both can be fixed without adding 
extra hardware or slowing down the operation. 


Better, yet still not cryptographically sound, random 
numbers can be generated by exploiting the fact that 
memory cells are initially in an undetermined state [10]. 
The same behavior can be caused in flip-flops like those 
that make up the state register of the stream cipher simply 
by not resetting the flip-flops at initialization time. The 
cipher state would start in a random state and then evolve 
using the cipher’s feedback loop until a random number 
is needed. At this point, the register contains a mostly 
unpredictable number of the size of the state register. 


Because this design generates random numbers within 
the same registers that are used for the cipher states, it 
eliminates the need for a separate additional PRNG cir- 
cuit. The saved area could then be spent on increasing the 
size of the cipher state. In the area of the 48-bit Crypto-1 
and its 16-bit RNG, a 64-bit stream cipher that also pro- 
duces significantly better pseudo-random number could 
hence be implemented. This increases the size and qual- 
ity of the random numbers and at the same time increases 
the key size beyond the point where brute-force attacks 
can be done cheaply. 


To further improve the resistance against codebook at- 
tacks, the non-linear feedback should be combined with 
either key or ID when shifted into the register to break the 
bijective mapping between different key-ID pairs. This 
measure does not increase implementation costs, since 
we only integrate the output of the filter function which 
is already computed. 


To improve the resistance of the cipher against statisti- 
cal attacks, the update function must be made non-linear, 
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either by feeding some intermediate result of the filter 
function into the linear register or by using a non-linear 
feedback shift register instead. 


None of the possible fixes will make the cipher appropri- 
ate for high security applications, but they improve the 
resistance against the most concerning attacks and can 
be done without any additional implementation cost. 


4.2 Possible Defenses 


Possible ways to protect against the described attacks 
include using standard, peer-reviewed, established cryp- 
tography such as the 3-DES block cipher that is already 
found on some of the more expensive cards including 
some of the Mifare line of products. A cheaper alterna- 
tive that can be implemented in about twice the size of 
Crypto-1 is the Tiny Encryption Algorithm (TEA) [12, 
18]. This established low-cost block cipher has pub- 
licly been scrutinized for several years and is so far only 
known to be vulnerable to some expensive attacks [11]. 
While TEA is far more secure than Crypto-1, it is also 
much slower. A Mifare authentication takes little more 
than one millisecond, while a minimum-size implemen- 
tation of TEA would take about ten times as long. This 
would still be fast enough for most applications where 
Mifare cards are currently used. 


Other known ways to protect against card cloning in- 
clude fraud detection algorithms that are widely used in 
monitoring credit card transactions. These algorithms 
detect unusual behavior and can prevent fraudulent trans- 
actions, but require storing and analyzing transaction 
data, which runs contrary to the desire for privacy in 
RFID applications. Fraud protection systems also re- 
quire all readers to be constantly connected to a central 
server, which is not the case in some of the current and 
planned deployments of RFID tags where offline readers 
are used. 


Tamper-proofing can be used to protect secret keys 
from attackers, but provides little help against hardware 
reverse-engineering because the structure of the circuits 
will always be preserved. The implementation, however, 
could be obfuscated to increase the complexity of the cir- 
cuit detection. While we believe that obfuscations will 
not make our approach infeasible, we do not yet know 
to what degree obfuscations could increase the effort and 
cost required to reverse-engineer a circuit. 


All low-cost cryptographic RFID tags are currently ill- 
suited for high security applications because they lack 
tamper-proofing and are vulnerable to relay attacks. In 
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these attacks, the communication between a legitimate 
reader and a valid card is relayed through a tunnel 
thereby giving the reader the false impression that the 
card is in its vicinity. No level of encryption can pro- 
tect against relay attacks and new approaches such as 
distance bounding protocols are needed [8]. 


5 Conclusions 


Reverse-engineering functionality from silicon imple- 
mentations can be done cheaply, and can be automated 
to the point where even large chips are potential targets. 
This work demonstrates that the cost of finding the algo- 
rithm used in a hardware implementation is much lower 
than previously thought. Using template matching, algo- 
rithms can be recovered whose secrecy has so far pro- 
vided a base for security claims. The security of embed- 
ded cryptography, therefore, must not rely on obscurity. 
Any algorithm given to users in form of hardware can 
be disclosed even when no software implementation ex- 
ists and black-box analysis is infeasible. Once the de- 
tails of a cryptographic cipher become public, its secu- 
rity must rely entirely on good cryptographic design and 
sufficiently long secret keys. 


The cryptographic strength of any security system de- 
pends on its weakest link. Besides the cryptographic 
structure of the cipher, weaknesses can arise from pro- 
tocol flaws, weak random numbers, or side channels. 
When random numbers are weak and the user identifica- 
tion is not properly mixed into the secret state, codebooks 
can be pre-computed that lead to attacks that are much 
more efficient than brute force. In the case of the Mifare 
Classic cards, the average attack cost shrinks from sev- 
eral hours to minutes. Their cryptographic protection is 
hence insufficient even for low-valued transactions. 


The question remains open as to whether security can be 
achieved within the size of the Mifare Crypto-1 cipher. 
The area of less than 500 gates may be too small to even 
hold a sufficiently large state, regardless of the circuits 
needed for the complex operations required for strong 
ciphers. 
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Abstract 


Graphics processors are continuing their trend of vastly 
outperforming CPUs while becoming more general pur- 
pose. The latest generation of graphics processors have 
introduced the ability handle integers natively. This has 
increased the GPU’s applicability to many fields, espe- 
cially cryptography. This paper presents an application 
oriented approach to block cipher processing on GPUs. 
A new block based conventional implementation of AES 
on an Nvidia G80 is shown with 4-10x speed improve- 
ments over CPU implementations and 2-4x speed in- 
crease over the previous fastest AES GPU implementa- 
tion. We outline a general purpose data structure for rep- 
resenting cryptographic client requests which is suitable 
for execution on a GPU. We explore the issues related 
to the mapping of this general structure to the GPU. Fi- 
nally we present the first analysis of the main encryption 
modes of operation on a GPU, showing the performance 
and behavioural implications of executing these modes 
under the outlined general purpose data model. Our AES 
implementation is used as the underlying block cipher to 
show the overhead of moving from an optimised hard- 
coded approach to a generalised one. 


1 Introduction 


With the introduction of the latest generation of graph- 
ics processors, which include integer and float capa- 
ble processing units, there has been intensifying inter- 
est both in industry and academia to use these devices 
for non graphical purposes. This interest comes from 
the high potential processing power and memory band- 
width that these processors offer. The gap in processing 
power between conventional CPUs and GPUs (Graph- 
ics Processing Units) is due to the CPU being optimised 
for the execution of serial processes with the inclusion 
of large caches and complex instruction sets and de- 
code stages. The GPU uses more of its transistor bud- 
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get on execution units rather than caching and control. 
For applications that suit the GPU structure, those with 
high arithmetic intensity and parallelisability, the per- 
formance gains over conventional CPUs can be large. 
Another factor in the growth of interest in general pur- 
pose processing on GPUs is the provision of more uni- 
form programming APIs by both major graphics proces- 
sor vendors, Nvidia with CUDA (Compute Unified De- 
vice Architecture) [1] and AMD with CTM (Close To 
Metal) [2]. 


The main obstacle with achieving good performance 
on a GPU processor is to ensure that all processing units 
are busy executing instructions. This becomes a chal- 
lenge in consideration of Nvidia’s latest processor, which 
contains 128 execution units, given the restrictions of 
its SPMD (Single Program Multiple Data) programming 
model and the requirement to hide memory latency with 
a large number of threads. With respect to private key 
cryptography and its practical use, a challenge exists in 
achieving high efficiency particularly when processing 
modes of operation that are serial in nature. Another 
practical consideration is the current development over- 
head associated with using a GPU for cryptographic ac- 
celeration. Client applications would benefit from the 
ability to map their general cryptographic requirements 
onto GPUs in an easy manner. 

In this paper we present a data model for encapsulating 
cryptographic functions which is suitable for use with the 
GPU. The application of this data model and the details 
of its interaction with the underlying GPU implementa- 
tions are outlined. In particular we investigate how the 
input data can be mapped to the threading model of the 
GPU for modes of operation that are serial and parallel 
in nature. We show the performance of these modes and 
use our optimised AES implementation to determine the 
overhead associated with using a flexible data model and 
its mapping to the GPU. Also included in the paper is 
a study of the issues related to the mixing of modes of 
operation within a single GPU call. 
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Motivation: The motivation for this research is based 
on the GPU acting as a general guide for the long-term 
direction of general purpose processing. X86 architec- 
tures are bottlenecking with limited increase in clock fre- 
quency reported in recent years. This is being tackled by 
the addition of cores to a single die to provide growth in 
the total available clock cycles. The GPU is the logical 
extreme of this approach where the emphasis has always 
been on more but simpler processing elements. The up- 
coming AMD’s Accelerated Processing Unit (Swift) [3] 
architecture is a reasonable compromise where a CPU 
and GPU are combined onto a single chip. Also Intel 
are developing computing solutions under the TeraScale 
banner which include a prototype of an 80 core proces- 
sor. Using GPUs as a research platform exposes the is- 
sues that general purpose processing will encounter in 
future highly parallel architectures. Another motivation 
is the use of GPUs as a cryptographic co-processor. The 
types of applications that would most likely benefit are 
those within a server environment requiring bulk cryp- 
tographic processing, such as secure backup/restore or 
high bandwidth media streaming. We also wish to show 
the implications of the inclusion of the GPU as a generic 
private key cryptographic service for general application 
use. 

Organisation: In Section 2 a brief description of the 
essentials in GPU hardware used is outlined, along with 
the CUDA programming model. Section 3 shows the 
related work in cryptography on non general purpose 
processors with a focus on GPUs. We present an im- 
plementation of AES on the Nvidia’s G80 architecture 
and show its performance improvements over compa- 
rable CPU and GPU implementations in Section 4. In 
Section 5 we introduce the generic data model suited to 
GPUs, which is used to encapsulate application crypto- 
graphic requirements. Section 6 describes in detail the 
steps of mapping from the generic data structure to un- 
derlying GPU implementations. All three previous sec- 
tions are combined by the implementation of modes of 
operation using the outlined data model and the opti- 
mised AES implementation in Section 7. This shows the 
overheads associated going from a hardcoded to a more 
general purpose implementation. 


2 GPU Background 


In this section we present a brief account of the GPU ar- 
chitecture used in the implementations presented within 
this paper, the Nvidia G80. We also give an outline 
of the new CUDA [1] programming model which has 
been introduced by Nvidia to provide a non graphics 
API method of programming the G80 generation of pro- 
cessors. Previous to this programming interface either 
OpenGL [4] or DirectX [5] had to be used at a consid- 
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erable learning expense to the programmer. AMD have 
also introduced their own software stack to tackle the is- 
sue of providing a more user friendly programming in- 
terface to their processors - CTM [2], however we do 
not cover this here. The G80 processors are DX10 [6] 
standard compliant which implies it belongs to the first 
generation of GPUs which support integer data units and 
bitwise operations. A key advancement relating to the 
field of cryptography. 

Physical View: The G80 can consist of up to 16 multi- 
processors within a single chip. Each of these multipro- 
cessors consist of 8 ALU (Arithmetic and Logic Unit) 
units which are controlled by a single instruction unit 
in a SIMD (Single Instruction Multiple Data) fashion. 
The instruction unit only issues a single instruction to 
the ALUs every four clock cycles. This creates an ef- 
fective 32 SIMD width for each multiprocessor, ie. a 
single instruction for 32 units of data. Each multipro- 
cessor has limited fast on-chip memory consisting of 32 
bit register memory, shared memory, constant cache and 
texture cache. All other forms of memory, linear, texture 
arrays are stored in global memory, ie. off-chip. GPUs 
can be used in arrangements of multiple chips on a sin- 
gle graphics card and also multiple boards on a single 
mother board. For all implementations and comparisons 
with CPUs we have restricted the arrangements used to 
single GPU and single CPU core. 

Execution Model: The CUDA programming model 
provides a way to programme the above chip in a rel- 
atively straight forward manner. The programmer can 
define threads which run on the G80 in parallel using 
standard instructions we are familiar with within the field 
of general purpose programming. The programmer de- 
clares the number of threads which must be run on a 
single multiprocessor by specifying a block size. The 
programmer also defines multiple blocks of threads by 
declaring a grid size. A grid of threads makes up a single 
kernel of work which can be sent to the GPU and when 
finished, in its entirety, is sent back to the host and made 
available to the CUDA application. 

Two more points of note which are relevant to this pa- 
per. First, all threads within a single block will run only 
on a single multiprocessor. This allows threads within a 
single block to have the ability to share data with other 
threads within the block via shared memory. Inter block 
communication is not possible as there is no synchro- 
nisation method provided for this. Second, due to the 
32 SIMD wide execution arrangement described above, 
Nvidia have introduced the notion of a warp. A warp 
represents a grouping of threads into a unit of 32. These 
threads run on the same multiprocessor for the entire 4 
cycles required to execute a single instruction. Threads 
are assigned to a warp in a simple serially increasing or- 
der starting a O for the first thread within a block. Per- 
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formance issues can arise when a group of 32 threads 
diverge in their code path, this causes the entire 4 cycles 
to be run for every unique instruction required by the 32 
threads. 


3 Related Work 


A variety of non general purpose processors has been 
used in the implementation of private key ciphers over 
the years. Specifically within the field of graphics pro- 
cessors, the first implementation of any cipher was by 
Cook et al. [7]. They implemented AES on an Nvidia 
Geforce3 Ti200, which had little programmable func- 
tionality. Their implementation was restricted to using 
the OpengGIl library and only a fixed function graphics 
pipeline. They describe the use of configurable color 
maps to support byte transforms and the use of the fi- 
nal output stage of the pipeline (Raster Operations Unit 
(ROP)) to perform XORs. Unfortunately due to the re- 
strictive nature of the hardware used and having to per- 
form all XORs in the final output stage of the pipeline, 
multiple passes of the pipeline were required for each 
block. The authors presented a successful full implemen- 
tation running within the range of 184 Kbps - 1.53 Mbps. 

Harrison et al. [8] presented the first CPU competitive 
implementation of a block cipher on a GPU. They used 
the latest DX9 compliant generation of graphics proces- 
sor to implement AES, namely an Nvidia 7900GT. These 
processors support a more flexible programming model 
compared to previous models, whereby certain stages of 
the graphics pipeline can execute C like programmer de- 
fined threads. However, the 7900GT only supports float- 
ing point operations. 3 different approaches were investi- 
gated to overcome the lack of integer bitwise operations 
on the programmable portion of the pipeline. The XOR 
operation was simulated using lookup tables for 4 bit and 
8 bit XORs, and also the hardware supported XOR func- 
tion within the final stage (ROP) of the pipeline. Their 
results showed that a multipass implementation using the 
built in XOR function combined with a technique called 
ping-ponging of texture memory to avoid excess data 
transfers across the PCIe bus could be used to achieve 
arate of 870 Mbps. 

More recently the latest generation of hardware, which 
supports integer data types and bitwise operations, has 
been used by Yang et al. [9] to produce much improved 
performance results. This paper focuses on a bitslicing 
implementation of DES and AES which takes advantage 
of the AMD HD 2900 XT GPU’s large register size. The 
GPU is used as a 4 way 32 bit processor which operates 
on four columns of 32 bitsliced AES state arrays in paral- 
lel. They show rates of 18.5 Gbps processing throughput 
for this bitsliced AES implementation. A bitsliced im- 
plementation isn’t suitable for general purpose use as it 
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requires heavy preprocessing of the input blocks. The 
authors [9] argue that their bitslicing approach can be 
put to use as a component in template-based key search- 
ing utility or for finding missing key bytes in side chan- 
nel attacks whereby the input state is static relative to 
the key. A conventional block based implementation of 
AES is also presented in this paper, running at rates of 
3.5 Gbps. Whether this includes transfers of input/output 
blocks across the PCle bus is not indicated. 

Other non general purpose processors used for private 
key cryptography include various ASIC designs such as 
[11] [12] [13] and custom FPGA efforts [14] [15]. GPUs 
have also been applied in the field of public key cryptog- 
raphy. A paper by A. Moss et al. [10] tackles the problem 
of executing modular exponentiation on an Nvidia 7800 
GTX. They present promising results showing a speed 
up of 3 times when compared to a similar exponentia- 
tion implementation on a standard X86 processor. Also 
related to graphics processors, Costigan and Scott [16] 
presented an implementation of RSA using the Playsta- 
tion’s 3 IBM Cell processor. They were able to increase 
the performance of RSA using the Cell’s 8 SPUs over its 
single PowerPC core. 


4 Block Based AES Implementation 


In this section we present an optimised implementation 
of the AES cipher in CTR mode on an Nvida 8800 GTX 
(G80) using the CUDA programming model. The aim 
of this section is to provide the performance figures and 
implementation approach which will be used in conjunc- 
tion with the data model described in Section 5. As such 
the chosen implementation in this section is an ideal, non 
general purpose, implementation which can be used as a 
source of comparison with generalised approaches. 

As previously mentioned the G80 architecture sup- 
ports integer bitwise operations and 32 bit integer data 
types. These new features, which are shared by all DX10 
[6] compatible GPUs, simplify the implementation of 
AES and other block ciphers. This allows for a more 
conventional AES approach compared to implementa- 
tions on previous generations of graphics processors. We 
based our implementations around both the single 1 KB 
and 4 x 1 KB precalculated lookup tables which were 
presented in the AES specification paper [17], see Equa- 
tions | and 2 respectively. 


ej = kj ®Tolac,;)] © Rot(Tolaaj—c1y] ® 
Rot(To[a(2,;—c2)] @ Rot(To[a:3,;—c3)]))) : (1) 
ej = Tolac,;)] © Tilaa,j—c1y)] © T2[a2,j—c2)] 


®T3[a(3,;~-c3)| @ ky - (2) 
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As XORs are supported in the programmable section 
of the graphics pipeline, there is no need to use the 
ROP XOR support, which required multiple passes of 
the pipeline - one for each XOR operation. Each thread 
that is created, calculates its own input and output ad- 
dress for a single data block and runs largely in isolation 
of other threads in a single pass to generate its results. 
The simple thread to I/O data mapping scheme used for 
all implementations reported in this section is as follows. 
Each thread’s index relative to the global thread environ- 
ment for a kernel execution is used as the thread’s offset 
into the input and output data buffers: 

int index = threadIdx.x + (blockIdx.x * blockDim.x); 
uint4 state = pt[index]; 
ct[index] = state; 


where blockDim is the number of CUDA blocks within 
the CUDA grid, blockId is the current CUDA block the 
thread exists within and threadId is the current thread in- 
dex within the CUDA block. As CTR is a parallel mode 
of operation, each thread works on a single AES block 
independently of other threads. To achieve high perfor- 
mance on a GPU or any highly multi-threaded processor, 
an important programming goal is to increase occupancy. 
The level of occupancy on a parallel processor indicates 
the number of threads available to the thread scheduler 
at any one time. High occupancy ensures good resource 
utilisation and also helps hide memory access latency. It 
is for occupancy reasons that we create a single thread 
for each input block of data. 

A nonce is passed to all threads through constant 
memory and the counter is calculated in the same man- 
ner as the data offsets above. Rekeying was simplified by 
using a single key for all data to be encrypted, with the 
key schedule generated on the CPU. The reason for im- 
plementing the rekeying process on the CPU rather than 
the GPU is that it is serial in nature, thus the generation 
of a key schedule must be done within a single thread. 
It would be an unacceptable overhead per thread (ie. per 
data block) when processing a parallel mode of opera- 
tion, for each thread to generate its own schedule. 

Host and Device Memory: We investigated using 
both textures and linear global memory to store the input 
and output data on the device. Through experimentation 
we found global memory to be slightly faster than tex- 
ture memory for input data reads and writes, thus all our 
implementation results are based on using linear global 
memory reads and writes for plaintext and ciphertext 
data. Regarding host memory (CPU side), an important 
factor in performance of transferring data to and from 
the GPU is whether one uses page locked memory or not. 
Page locked memory is substantially faster than non page 
locked memory as it can be used directly by the GPU via 
DMA (Direct Memory Access). The disadvantage is that 
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ete Coherent Reads | Random Reads 
Shared Memory 0.204319s 0.433328s 


0.176087s 0.960423s 
0.702573s 1.2379 14s 


Table 1: On-chip Memory Reads: Average execution 
times of 5 billion 32-bit reads. 





systems have limited amount of page locked memory as 
it cannot be swapped, though this is seen normally as an 
essential feature for secure applications to avoid paging 
sensitive information to disk. 

On-chip Memory: As the main performance bottle- 
neck for a table lookup based AES approach is the speed 
of access to the lookup tables, we implemented both 
lookup table versions using all available types of on-chip 
memory for the G80. The types used are texture cache, 
constant cache and shared memory. Shared memory is 
shared between threads in a CUDA block and is lim- 
ited to 16KB of memory per multiprocessor. It should 
be noted that shared memory is divided into 16 banks, 
where memory locations are striped across the banks in 
units of 32 bits. 16 parallel reads are supported if no bank 
conflicts occur and for those that do occur, they must be 
resolved serially. The constant memory cache working 
set is 8 KB per multiprocessor and single ported, thus it 
only supports a single memory request at one time. Tex- 
ture memory cache is used for all texture reads and is 
8 KB in size per multiprocessor. To investigate the the 
read performance characteristics of these types of mem- 
ory we devised read tests to access the three types of 
memory in two different ways. We split the tests into 
random and coherent read memory access patterns, each 
test accessing 5 billion integers per kernel execution. Co- 
herent access patterns were included as there are oppor- 
tunities to exploit coherent reads within shared memory, 
ie. reads with no bank conflicts for a half warp of 16 
threads. 

In Table 1 we can see the average execution times mea- 
sured in seconds to perform the 5 billion reads. Constant 
memory performs best with regard to coherent reads, 
though as constant memory is single ported and the 
lookup tables will be accessed randomly within a warp 
it is of little use. Shared memory out performs in the sce- 
nario of random access reads by a large margin due to its 
high number of ports. It should be noted that both texture 
and constant memory can be loaded with data before the 
kernel is called, however a disadvantage to shared mem- 
ory is that it must be setup once per CUDA block. Shared 
memory is designed for use as an inter thread commu- 
nication memory within the one multiprocessor and not 
designed for preloading of data. This is most likely an 
API limitation and would help reduce the setup overhead 
if a mechanism existed to setup shared memory once per 
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[J Shared Memory 


Single Table 5,945 Mbps 4,123 Mbps 4,200 Mbps 





Quad Table 6,914 Mbps 4,085 Mbps 4,197 Mbps 


Table 2: AES CTR maximum throughput rate for different types of on-chip memory. 
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Figure 1: Optimised AES CTR implementation with and without data transfers. 


multiprocessor before kernel execution. In an attempt 
to gain from the performance benefits of coherent mem- 
ory reads when using shared memory we copied a single 
lookup table 16 times across all 16 banks to avoid mem- 
ory conflicts with careful memory addressing. However 
it turns out that CUDA does not allow access to all 16K, 
even though it advertises it as such. In fact the developer 
only has access to slightly less than 16 KB, as the first 
28 bytes are reserved for system use and if over written 
by force causes system instability. Various optimisations 
were attempted to avoid bank conflicts, the fastest ap- 
proach used 16 x | KB tables save the last entry. A sim- 
ple check if the last lookup entry is being sought and its 
direct value is used instead. 


AES: In Table 2 we can see the maximum perfor- 
mance of the different AES implementations using the 
different types of on-chip memory. It can be seen that 
the 4 x 1 KB table approach, Quad Table, using shared 
memory performs the fastest. This approach requires the 
four | KB tables to be setup within shared memory for 
each CUDA block of threads running. This setup can 
be alleviated by allocating the task of a single load from 
global memory into shared memory to each thread within 
the block. For this reason our implementation uses 256 
threads per block, giving the least amount of overhead to 
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perform the setup operation. The coherent shared mem- 
ory | KB table lookup under performs due to the extra ro- 
tates which must be executed, the extra conditional check 
for sourcing the last table entry as described above and 
the additional per CUDA block memory setup costs. Pre- 
vious generations of GPUs could hide the cost of state 
rotates via the use of swizzling (the ability to arbitrarily 
access vector register components) however the G80 no 
longer supports this feature. 


Figure 1 shows the performance of AES CTR based 
on the above 4 x | KB table lookup approach. The figure 
exposes the requirement of many threads to hide memory 
read latency within the GPU. We display the throughput 
rate of the cipher with and without plaintext and cipher- 
text transfers across the PCIe bus per kernel execution. 
A maximum rate of 15,423 Mbps was recorded without 
transfers and a maximum of 6,914 Mbps was recorded 
with transfers included. We have included the rates with- 
out data transfer as we believe these to be relevant going 
forward where the possibility exists for either: sharing 
the main memory address space with the CPU, either 
in the form of a combined processor or a direct moth- 
erboard processor slot; or overlapping kernel execution 
and data transfer support on the GPU. 


In the same figure we have compared our results with 
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the latest reported AES implementations on CPUs and 
GPUs. Matsui [20], reports speeds of 1,583 Mbps for 
conventional block based implementation of AES on 
a 2.2GHz AMD 64 processor. [18] reports an ECB 
AES implementation of 1,151Mbps on an AMD 64 
2.4 GHz 64 processor. The authors of [9], cite a speed 
of 3,584 Mbps on an AMD HD 2900 XT GPU (AMD’s 
DX10 compliant GPU) for their block based AES imple- 
mentation, though we do not have access to the through- 
put rates as data sizes increase. We have included the 
rates achieved in [8] for AES on a GeForce 7900GT, 
which does provide this rate progression. With transfers 
included, we see a 4x speed up over the fastest CPU re- 
ported implementation anda 10x speed up without trans- 
fers. Scaling up the reported CPU AES rates to the latest 
available AMD core clock speed, our GPU implementa- 
tion still substantially outperforms. When compared to 
the block based AES implementation on a GPU by [9] 
we can see 2x and 4x speeds ups with and without data 
transfers respectively. 


5 Payload Data Model 


In this section we introduce the generic data model which 
we use to allow the exploration of the problems involved 
in mapping a generic private key cryptographic service to 
specific GPU implementations. The aim of this section 
is to outline the data model used, its design criteria and 
the usage implications in the context of GPUs. 


5.1 The Data Model 


We use the term payload to indicate a single grouping 
of data which contains both data for processing and its 
instructions. The client application which requires cryp- 
tographic work is responsible for the creation of a pay- 
load and hand off to a runtime library which can direct 
the payload to the appropriate implementation. The data 
model described is similar to the fundamental principals 
of the OpenBSD Cryptographic Framework [19] and as 
such the implementations presented could potentially be 
integrated into such a runtime environment. 

One of the main criterion for a data model in this con- 
text is to allow the buffering of as many messages as 
possible that require processing into a single stream, per- 
mitting the GPU to reach its full performance potential. 
Exposing a payload structure to the user rather than a 
per message API allows the grouping of multiple mes- 
sages. Also, the pressure for increase in data size can be 
met by providing the ability for different cryptographic 
functions to be combined into a single payload. Another 
design criteria is the use of offsets into the data for pro- 
cessing rather than pointers as the data will be transferred 
outside of the client applications memory address space 
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rendering pointers invalid. The data model must also al- 
low the client to describe the underlying data and keys, 
with key reuse, in a straight forward manner. 

In the following pseudocode we can see the key data 
structures used. The main ’payload” structure contains 
separate pointers for data and keys as these are normally 
maintained separately. A single payload descriptor struc- 
ture is also referenced which is used to describe the map- 
ping of messages to the data and key streams. The pay- 
load descriptor uses an ID to uniquely identify payloads 
in an asynchronous runtime environment. A high level 
mode for which cryptographic service is required, can 
be described within the payload descriptor or within the 
individual messages. The need for a higher level mode 
is due to the requirement of frameworks which abstract 
from multiple hardware devices having to select suitable 
hardware configuration to implement the entire payload. 
A lower level property can also be used to describe the 
cryptographic mode on a per message basis as can be 
seen in the *msgDscr’ structure. The message descrip- 
tor also provides pointers for arrays of messages, IVs 
(Initialisation Vector), ADs (Associated Data), tags, etc. 
Each of these elements use the generic element descrip- 
tor which allows the description of any data unit within 
the data and key stream using address independent off- 
sets. The element descriptor separates the concept of el- 
ement size and count as the size of elements can some- 
times indicate a functional difference in the used cipher. 
The return payload is the similar to the payload structure, 
though without the keys. 

struct payload { 
unsigned char *data; 
unsigned char *keys; 
struct payloadDscr *dscr; }; 


struct payloadDscr { 
unsigned int id; 
struct key Value *payloadMode; 
unsigned int msgcount; 
unsigned int size; 
struct msgDscr *msgs; 
struct elementDscr *keys; }; 


struct msgDser { 
struct element_dscr *msg; 
struct element_dscr *iv; 
struct element_dscr *ad; 
struct element_dscr *tag; 
struct key_value **msgMode; }; 


struct element_dscr { 
unsigned int count; 
unsigned int offset; 
unsigned int size; }; 


5.2 General Use Implications 


A consideration regarding the use of per message prop- 
erties to indicate separate functions is that it adds extra 
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Figure 2: Serialised Streams used by each thread for Data and Key indexing. 


register pressures on SPMD architectures such as GPUs. 
These processors can only execute a single kernel code 
across all threads, any variation in function must be im- 
plemented using conditional branches. This technique is 
called using fat kernels, where a conditional branch indi- 
cates a large variation in underlying code executed at run 
time. On SPMD processors it is better for performance 
if all messages within a payload use the same function, 
which is determined before kernel execution time. 

Another concern when employing a data model for use 
with an attached processor, such as a GPU, is memory al- 
location for I/O buffers. For the G80 it is important to use 
pinned (page locked memory), this requires a request to 
be made to the CUDA library. The CUDA library then 
returns a pointer to the memory requested which can be 
used within the calling process. Both the input and out- 
put buffers should use pinned memory and also reuse the 
same buffers when possible for maximum performance. 
Thus there is a need for the client to be able to request 
both input and output buffers, to allow the tracking of 
its allocated buffers as the implementation cannot make 
a buffer reuse decision independently. This requires the 
encapsulating runtime, for example such as a framework 
like the OpenBSD Cryptoraphic Framework, to support 
mapping of memory allocation requests through to the 
library representing the hardware which will service the 
payload. 


6 Applied Data Model 


In this section we cover implementation concerns when 
bridging between the previously described general pur- 
pose data model and specific GPU cipher implementa- 
tions. In particular we focus on our implementation of 
a bridging layer, which maps the data model to our spe- 
cific cipher modes of operation presented in Section 7. 
The overhead of providing a general purpose interface 
point to a GPU implementation is the addition of ab- 
straction layers which need to be resolved within each 
kernel thread. Throughput is lost when message func- 
tions, sizes, element types, etc can vary within a payload. 
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Each thread must perform extra memory accesses, cal- 
culations and conditional branches to dereference these 
dynamic settings. These per thread calculations can be 
offset by an implementation using the CPU as a prepro- 
cessing stage which optimises a payload for thread pars- 
ing before the payload is dispatched. Naturally there is a 
balance to CPU preprocessing, as one of the reasons for 
using a GPU is to act as a co-processor which in effect 
speeds up the overall throughput of a system. 


6.1 Descriptor Serialisation 


Each element in the message descriptor requires serial- 
ising on the CPU into a form which can be used inde- 
pendently and quickly within each thread on the GPU. 
An implementation determines the message descriptor 
element size during serialisation, thus given a message 
ID, a thread can directly lookup the corresponding mes- 
sage instructions. Each serialised element contains the 
message data stream offset, size, function, and whatever 
other information is required specific to the implementa- 
tion. The key descriptor, which contains the access in- 
formation for the key schedules, requires the generation 
of a separate key schedule stream before it can be se- 
rialized. Both serialised descriptor streams and the key 
schedule streams are transferred to the GPU and stored 
within texture memory address space, which gives the 
best size flexibility of the cacheable memory types. 
Logical Thread Index: During message serialisation 
a logical thread index stream is produced to facilitate 
the efficient location of a message ID given a thread’s 
ID. This stream contains a single thread for each serial 
mode of operation (MOO) message and as many threads 
as there are blocks for each parallel MOO message. The 
entries within the logical thread index consist of only the 
logical thread IDs which start a message. Figure 2 shows 
an example of a logical thread index and how it relates 
to the message descriptor stream. We call the stream 
a logical thread index because the physical thread IDs 
(those assigned by the GPU), which partially determine 
the thread’s physical location within the processor, do 
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Figure 3: Process of mapping physical threads to message IDs. 


not necessarily map directly onto the entries within the 
thread index. To support balancing of work across the 
multiprocessors of the GPU we require the ability to as- 
sign work to different threads depending on their physi- 
cal ID. Balancing work across the GPU is important for 
serial MOO messages, where the number of messages 
may be low and the size of messages may be high. 

Rekeying: As outlined previously, the GPU is a 
highly parallel device and the key schedule generation is 
inherently serial, thus in general it makes most sense to 
implement keying on the CPU prior to payload dispatch. 
Our implementation uses the CPU with a hashtable cache 
for storing key schedules to ensure key reuse across mes- 
sages. This is not just to aid efficiency at the key sched- 
ule generation stage on the CPU but also to generate the 
smallest key schedule stream possible. This is important 
for on-chip GPU caching of the key schedules. When the 
client application is generating the key stream for pay- 
load inclusion, it is important for the same keys to use 
the same position within the stream. This allows for fast 
optimisation of key schedule caching based on key off- 
sets rather than key comparison. 


6.2 Thread to Message Mapping 


The full process for mapping a thread to a message ID 
and its underlying data is the following, this is also 
shown in simplified form in Figure 3. 

1. Generation of the logical thread index for all 
messages as outlined previously. This work is carried on 
on the CPU. 


2. Mapping of the physical (GPU assigned) thread ID 
to a logical thread ID within each kernel thread. This was 
implemented using two different algorithms, one with a 
focus on performance and the other a focus on client ap- 
plication control for load balancing. The first approach 
maps physical to logical threads in a | to 1 manner in 
multiples of 16 x 256 threads, until the last and poten- 
tially partially full 16 x 256 group of threads. The last 
group of threads is allocated evenly across the multipro- 
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cessors assigning physical IDs in natural order. The rea- 
son for using 16 x 256 threads, is that the implementa- 
tions used assign a fixed 256 threads per CUDA block 
in multiples of 16 blocks (ie. the number of multipro- 
cessors on the 8800GTX processor). This is done to en- 
sure the simplest form of shared memory configuration 
for lookup tables, see Section 4. This approach is fast 
to execute as a single check can eliminate the case of 
full thread groups where physical and logical IDs are the 
same. The second approach maps each physical thread 
into groups of 32 striped across each CUDA block. This 
mapping is executed for every thread and thus is slightly 
slower than the first approach. It however gives a more 
consistent mechanism for mapping physical threads to 
messages and thus is more controllable by the client ap- 
plication. In the first approach its difficult or impossible 
to insert serial MOO messages so that they are evenly 
spread across the available multiprocessors. We use the 
second approach in our reporting of results as it is only 
0.25% slower than the first and thus the advantage out 
weighs the performance hit. See Section 7.3 for the ef- 
fects of loadbalancing work across the GPU. 


3. Search of logical thread index with logical thread 
ID to determine message ID for kernel. This step also 
calculates logical thread ID offset from beginning of 
message. Due to storing a digested form of the logi- 
cal thread IDs, in which each entry in the logical thread 
index is the thread ID start of a message, the search is 
implemented as a binary search. A direct lookup table 
could be used for better performance however this would 
require a lookup table equal to the number of logical 
threads. For parallel MOO messages this would be too 
high an overhead in terms of data transfer and cacheabil- 
ity of the index. The digest version only stores a thread 
index entry per message within the payload and also pro- 
vides an easy way to calculate the thread offset from the 
start thread (ie. first block) for parallel MOOs messages. 


4. Use of message ID to offset into the message de- 
scriptor stream, which is used to retrieve the input data 
offset and other message settings. 
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6.3 Padding 


The client application can set padding or not for each 
message within the message descriptor. As the abil- 
ity to generate a link list of addresses for use during 
DMA transfer is not supported in CUDA, it results in too 
high an overhead for the CPU based serialization process 
to support pre-padding message directly into the data 
stream for sending to the attached device. The reason for 
this is that the CPU would have to generate a new single 
stream from contiguous memory based on the new inser- 
tions and the original data stream. An alternative more 
efficient approach is to embedded the padding instruc- 
tions into the message descriptor stream which indicates 
the types of padding required. This requires that each 
thread checks if extra padding is required and to gener- 
ate the necessary extra data itself. In relation to CUDA 
this extra check causes thread divergence for the single 
thread that must execute the padding. However the over- 
head is generally very low as the divergence only lasts 
for a single cipher block across 32 threads. If a full new 
block is required, as potentially in PKCS#5 for example, 
then an extra block is required in the output. This is an 
issue for GPUs as typically a live thread cannot allocate 
its own memory. The CPU must allocate for this extra 
space during serialisation before the payload is sent to 
the GPU. 


6.4 Payload Combining 


The bridging layer implementation can easily implement 
payload combining in the scenario where payloads are 
queued via the encapsulating framework. The multiple 
data and key schedule streams within host memory space 
can be copied into consolidated input buffers on the at- 
tached device. During serialisation stage, the serialised 
message and key descriptors are appended and offsets 
are recalculated taking into account the combined input 
streams on the attached device. Similarly processed pay- 
loads can be read from a consolidated output buffer on 
the attached device and read into separate host buffers. 
Generally ciphers do not change the size of the plaintext 
and ciphertext, padding aside, allowing efficient reuse 
(directly or copies) of the input payload descriptors. 


7 Modes Of Operation 


In this section we present the implementation and results 
of symmetric key modes of operation built using the pre- 
viously described data model, bridging layer and AES 
implementation. Modes of operation determine how the 
underlying block cipher is used to implement a cryp- 
tographic system which supports messages greater than 
one block in length. We have analysed the most com- 
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mon encryption modes, specifically CTR, CBC, CFB 
and OFB. Using these modes on a highly multithreaded 
device, the major overriding characteristic which deter- 
mines throughput is whether the mode can be imple- 
mented in parallel or must be done serially. CFB’s la- 
tency reduction is not relevant within the context of a 
payload where the entire message is sent and read back 
as a single unit. OFB and CTR allow the pregenera- 
tion of a key stream with subsequent XORing with the 
plaintext/ciphertext for its operation. This can provide 
good latency reduction whereby the execution of a pay- 
load can be split into two separate stages, one for key 
stream generation and one for XORing. However, re- 
garding an application that will gain from the use of 
a bulk cryptographic co-processor, the most important 
characteristic is throughput. We focus on the throughput 
of the two main categories of MOOs: serial MOO (CBC 
and CFB encryption and OFB), and parallel MOO (CBC 
and CFB decryption and CTR). All implementations are 
based on the optimised AES implementation presented 
in Section 4 using CUDA. Discounting block cipher per- 
formance variation, these results should provide a guide 
to the general behaviour of the investigated MOOs using 
other block ciphers on a GPU. 


7.1 Parallel MOOs 


It is easier to achieve full occupancy on a highly parallel 
processor such as a GPU when processing parallel MOO 
messages compared to serial MOO messages. Each mes- 
sage can be split into blocks and assigned its own thread, 
thus the number of threads equals the total number of 
blocks within the payload. Figure 4 shows the through- 
put rates of different message sizes used within payloads 
containing parallel MOO messages. The results shown 
are based on the CTR MOO. CBC and CFB decryption 
were also implemented, though the throughput rates did 
not vary. The number of messages indicates the number 
used within a single payload. As we can see, the greater 
the payload size the higher the performance. This is ex- 
pected due to increased resource occupancy and memory 
latency being more effectively hidden. We can also see 
that at a certain throughput rate the per message over- 
head of using a generic data model becomes the domi- 
nant overhead. As a result, increasing the payload mes- 
sage count past a certain point results in a drop in perfor- 
mance. 

All results are based on multiple executions of a sin- 
gle payload with the reuse of memory buffers both for 
host and on device storage. This simulates the scenario 
of an application managing its own host memory allo- 
cations as described in Section 5. Our implementations 
also reuse the same key, simulating all messages being 
within a single cryptographic session. We also include 
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Figure 4: Throughput rates for parallel MOO messages across varying block sizes. 


in Figure 4 throughput rates for maximum rate rekey- 
ing, whereby each message contains its own unique key. 
We have highlighted the comparison of rates with and 
without rekeying for payloads with a message size of 512 
blocks. As expected, an increasing message count results 
in an increasing overhead on total throughput. 


The maximum throughput achieved for a parallel 
MOO under the generic data model was 5,810 Mbps. An 
important observation from these figures is that we can 
see there is an overhead associated with using the de- 
scribed generic data model for abstracting the underly- 
ing implementation details. When using large messages 
(16384 blocks) this overhead is 16%, with medium sized 
messages (512 blocks) the overhead is 22% and in the 
worst case when using small messages (16 blocks) the 
overhead is 45%. The reasons for the increase in over- 
head as the message count increases relative to the work 
done, is due mainly to the caching behaviour of the in- 
dex stream descriptors used on the small 8 KB GPU tex- 
ture caches. For example, regarding the logical thread 
index stream when used for a parallel MOO payload, 
even though it efficiently encodes one logical thread per 
message (the starting block) and extrapolates the remain- 
ing threads for the message, each additional message re- 
quires an extra 32 bits. We see a consistent drop off 
in performance for larger message counts as each bi- 
nary search performed to map the physical thread ID 
to message ID must increasingly access global memory. 
There is also an increased overhead associated with CPU 
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preprocessing of messages and the number of steps in- 
volved in the thread to message mapping process. Fu- 
ture work involves attempts to optimise the streams used, 
even though we are somewhat restricted given that the 
GPU is an SPMD device and fat kernels add register 
pressure and reduce occupancy. In particular executing 
parallel messages in groups of data for large payloads in 
small message configurations could reduce the overhead 
of thread to message mapping. 


7.2 Serial MOOs 


The key to good performance with serial MOO messages 
is to include a lot of messages within the payload. Given 
a low number of messages, there will be a shortage of 
threads to maintain a high occupancy level on the GPU 
and thus performance will suffer. The serial implemen- 
tations go through the same thread to message mapping 
process as normal. The message descriptor contains the 
message size for serial messages, which is used to set the 
number of input blocks to be processed by each thread 
starting with the initialisation vector (referenced via the 
message descriptor). The input address start at the mes- 
sage offset and increase in single blocks treating the mes- 
sage as a contiguous section of the input data stream. 
This creates a memory access pattern where neighbour- 
ing threads access memory locations separated by the 
size of the messages they are processing. This access 
pattern has an important impact on throughput as will be 
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Figure 5: Throughput rates for serial MOO messages across varying block sizes. 


seen. 


Figure 5 shows the performance rates for a serial MOO 
using different sizes of messages. All results are based 
on the CBC MOO in encrypt mode, other serial MOOs 
using the same block cipher performed equivalently with 
regards to bulk throughput rates. All messages within a 
single payload were of the same size, see Section 7.3 for 
detail on mixing sizes of messages within a payload. We 
have included in the figure a CPU based implementation 
of the OFB MOO from [18], as a point of comparison. 
We have also included the results for a parallel MOO 
for a payload with a message size of 2048 blocks from 
Figure 4. The figure highlights its comparison with the 
corresponding serial MOO message size. We can see the 
penalty paid for a low number of serial messages within 
the payload as it takes quite a number of messages be- 
fore throughput substantially increases. This is easy to 
see in the comparison of the parallel MOO which starts 
at quite a reasonable throughput rate from a low mes- 
sage count. We can also see that there is performance 
to be gained by grouping blocks into threads which re- 
duce the per message overheads discussed above, this ac- 
counts for the higher performance of the large message 
count serial payloads over parallel payloads. These se- 
rial results cannot be compared with the AES optimised 
implementation in Section 4 directly for framework over- 
head calculations as the optimised AES implementation 
is implicitly parallel. 


A disturbing trend can be observed for larger serial 
MOO message sizes - as the number of messages in- 
crease a performance bottleneck is hit. This may be ex- 
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plained by the memory access pattern created by such 
executions. Neighbouring threads within a CUDA warp 
use increasingly disparate memory address locations for 
their input and output data as the message size increases. 
We have isolated this behaviour with a separate memory 
test in which each thread performs a series of sequen- 
tial reads from global memory starting at an offset from 
the previous neighbouring thread equal to the number of 
sequential reads performed. Figure 6 presents these re- 
sults for different offsets and corresponding sequential 
reads in increments of blocks (taken to be 16 bytes for 
this test). For block counts of 128 and over the memory 
read performance drops dramatically as the the number 
of active threads increase. There is not enough publicly 
available detail on the G80 to definitively explain this 
behaviour, however it is possibly a combination of level 
2 cache bottleneck and a limit on the number of sepa- 
rate DRAM open pages supported by the DRAM con- 
trollers. Either could cause performance drops as con- 
current memory reads reduce their coherency. 


7.3 Mixed MOOs 


Here we investigate the issues involving mixing both the 
MOOs and message sizes used within a single payload. 
The same occupancy consideration applies for mixed 
modes as for single modes, however in a mixed mode 
context, if a small number of serial MOO messages are 
present in the payload the presence of parallel MOO 
messages can help increase occupancy. Performance is- 
sues exist when there are imbalances in the number of se- 
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rial MOO messages to the amount of parallel MOO mes- 
sage work can be done. Another concern when mixing 
message types is the positioning of serial MOO messages 
across the available multiprocessors. The ideal scenario 
for work load balancing is for all serial messages to be 
divided evenly across the multiprocessors. Specifically 
concerning the G80 processor, there is the extra consid- 
eration of 32 threads being active at any one time on a 
single multiprocessor. This restricts the ideal division of 
serial MOO messages to be made in groups of 32. Also, 
for optimal arrangement of work to be done within these 
groups, they should be ordered by message size. This is 
to reduce the amount of empty SIMD slots during the 
execution of the serial MOO message groups. Paral- 
lel MOO message positioning and message size group- 
ing is not a concern as these types of messages are self 
load balancing as they are broken up evenly into threads 
which are load balanced by our thread to message map- 
ping scheme. 

To allow the client application sufficient control over 
the positioning of serial MOO messages on the hardware, 
we have used the physical thread to message mapping 
described in Section 6. This allows the client to sim- 
ply group all serial messages at the start of the payload 
if possible. The striping mapping scheme used will au- 
tomatically group the messages into groups of 32 and 
distribute the groups evenly across CUDA blocks, which 
will be assigned evenly to the available multiprocessors 
by the CUDA library. Also, the order in which the serial 
messages appear in the input stream is preserved, so if 
the client orders messages according to their size these 
are maintained in their optimal ordering for SIMD work 
load balancing. We developed a series of tests which 
allows us to demonstrate the effect of different mixing 
configurations of serial MOO messages across a payload. 
Figure 7 shows the throughput rates of different payload 
configurations. Each payload configuration consisted of 
the same messages, only the ordering of the messages 
were changed. The absolute throughput rates are not rel- 
evant as the messages used were manufactured to fit the 
test requirements and not for performance. The relative 
difference between the scenarios clearly shows the im- 
portance of correct ordering of messages when mixing 
serial and parallel MOO messages within a single pay- 
load. 

All payloads used 960 512-block parallel MOO mes- 
sages, 992 32-block parallel MOO messages and 1024 
serial MOO messages with 8 variations in message size 
ranging from 16 to 2048 blocks. Here is a description of 
each of the payload configurations used in Figure 7. 

Payload 1: One serial MOO message per group of 32 
threads, attempting to assign all serial MOO messages to 
a single multiprocessor. This assignment is not entirely 
possible as CUDA blocks are assigned to available mul- 
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tiprocessors and is beyond the control of the developer. 

Payload 2: One serial MOO message per 32 threads 
spread evenly across the multiprocessors. 

Payload 3: All serial MOO messages assigned to the 
minimum number of CUDA blocks. This scenario is 
much faster than | and 2 as all SIMD slots within 32 
threads are occupied, even though not all multiproces- 
sors are occupied with work. 

Payload 4: A random distribution of serial MOO mes- 
sages across the payload. 

Payload 5: A random distribution of serial MOO 
messages across the payload, however grouped into 32 
threads to ensure full SIMD slot usage. 

Payload 6: All serial MOO messages grouped into 32 
threads and spread evenly across all multiprocessors. 

Payload 7: Same as Payload 6 however all serial 
MOO messages appear within the payload in order of 
their message size. All other payload configurations use 
a random ordering of message sizes. 

From the results one can see the main priorities for 
client ordering of serial MOO messages within a pay- 
load are: their grouping within the device’s SIMD width 
to ensure the SIMD slots are occupied; the even spread of 
serial MOO message groups across the available mulit- 
processors; and the ordering of serial MOO messages ac- 
cording to their size to keep similar message sizes within 
the one SIMD grouping. A separate and notable concern 
when mixing function types within a payload is that the 
underlying implementation can suffer from increased re- 
source pressure. The G80, like other SPMD devices only 
support a single code path which execute on all threads, 
thus the combination of function support within a single 
kernel via conditional blocks can increase register pres- 
sure and also increase overhead for the execution of such 
conditions. 


8 Conclusion 


In this paper we have presented an AES implementa- 
tion using CTR mode of operation on an Nvidia G80 
GPU, which when including data transfer rates shows a 
4x speed up over a comparable CPU implementation and 
a 2x speed up over a comparable GPU implementation. 
With transfer rates across the PCle bus not included this 
ratios increase to 10x and 4x respectively. We have also 
investigated the use of the GPU for serving as a general 
purpose private key cryptographic processor. The inves- 
tigation covers the details of a suitable general purpose 
data structure for representing client requests and how 
this data structure can be mapped to underlying GPU im- 
plementations. Also covered are the implementation and 
analysis of both major types of encryption modes of op- 
eration, serial and parallel. The paper shows the issues 
and potentially client preventable caveats when mixing 
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these modes of operation within a single kernel execu- 
tion. 

We show that the use of a generic data structure results 
in an overhead ranging from 16% to 45%. The main rea- 
son for the drop in performance is due to the descriptor 
data streams becoming too large to fit in the small texture 
working cache size of the G80. This per thread overhead 
occurs most acutely within the implementation of paral- 
lel MOO payloads with small messages and a high mes- 
sage count. It could be argued that in such a case a client 
would be better implementing a hardcoded approach if 
the input data structures are known in advance. 

Overall we can see that the GPU is suitable for bulk 
data encryption and can also be employed in a general 
manner while still maintaining its performance in many 
circumstances for both parallel and serial modes of op- 
eration messages. Even given the overheads of using a 
generic data structure for the GPU, the performance is 
still significantly higher than competing implementations 
assuming chip occupancy can be maintained. However 
when small payloads are used the GPUs performance 
under performs both in the general and hardcoded im- 
plementations due to the resource underutilisation and 
the transfer overheads associated with movement of data 
across the PCIe bus. Further work is required in opti- 
mising the mapping of a generic input data structure to 
threads to improve the noted overheads. Also investi- 
gation of authenticated encryption modes of operation 
should be included in a similar study. 
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Abstract 


The Tor anonymisation network allows services, such as 
web servers, to be operated under a pseudonym. In pre- 
vious work Murdoch described a novel attack to reveal 
such hidden services by correlating clock skew changes 
with times of increased load, and hence temperature. 
Clock skew measurement suffers from two main sources 
of noise: network jitter and timestamp quantisation er- 
ror. Depending on the target’s clock frequency the quan- 
tisation noise can be orders of magnitude larger than the 
noise caused by typical network jitter. Quantisation noise 
limits the previous attacks to situations where a high 
frequency clock is available. It has been hypothesised 
that by synchronising measurements to the clock ticks, 
quantisation noise can be reduced. We show how such 
synchronisation can be achieved and maintained, despite 
network jitter. Our experiments show that synchronised 
sampling significantly reduces the quantisation error and 
the remaining noise only depends on the network jit- 
ter (but not clock frequency). Our improved skew esti- 
mates are up to two magnitudes more accurate for low- 
resolution timestamps and up to one magnitude more ac- 
curate for high-resolution timestamps, when compared 
to previous random sampling techniques. The improved 
accuracy not only allows previous attacks to be executed 
faster and with less network traffic but also opens the 
door to previously infeasible attacks on low-resolution 
clocks, including measuring skew of a HTTP server over 
the anonymous channel. 


1 Introduction 


The Tor [1] hidden service facility allows pseudonymous 
service provision, protecting the owners’ identity and 
also resisting selective denial of service attacks. High- 
profile examples where this feature would have been 
valuable include blogs whose authors are at risk of legal 
attack [2]. Other Tor hidden websites host suppressed 
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documents, permit the submission of leaked material, 
and distribute software written under a pseudonym. 

As Tor is an overlay network, servers hosting hidden 
services are accessible both directly and over the anony- 
mous channel. Traffic patterns through one channel have 
observable effects on the other, thus allowing a service’s 
pseudonymous identity and IP address to be linked. Mur- 
doch [3] described an attack to reveal hidden services 
based on remote clock skew measurement. 

Here, the attacker induces a load pattern on the vic- 
tim by frequently accessing the hidden service via the 
anonymisation network or staying silent. The load 
changes will cause temperature changes of the victim, 
which in turn induces deviation of the victim’s clock 
from the true time — clock skew. At the same time, the 
attacker measures the clock skew of a set of candidate 
hosts. Viewing induced clock skew as a covert channel 
the attacker can send a pseudorandom bit sequence to 
the hidden service and see if it can be recovered from the 
clock skew measurement of all candidates. 

The attacker can measure the target’s clock skew by 
obtaining timestamps from the target’s clock and com- 
paring these timestamps against the local clock. In pre- 
vious research, the clock skew was remotely measured 
by random sampling of timestamps from the clock. This 
measurement suffers from two sources of noise: varia- 
tions in packet delay (jitter) and timestamp quantisation. 
Network jitter is often small and skewed towards zero, 
even on long-distance paths, if there is no congestion. 
Quantisation noise depends on the frequency of the tar- 
get’s clock. Depending on the source of available times- 
tamps, the quantisation noise can be significantly larger 
than the noise introduced by typical network jitter in the 
Internet. 

To minimise the quantisation error, Murdoch proposed 
to use synchronised sampling instead of random sam- 
pling. Here, the attacker synchronises the timestamp re- 
quests with the target’s clock ticks, attempting to obtain 
timestamps immediately after the clock tick, where the 
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quantisation error is smallest. This approach has the po- 
tential to reduce the quantisation noise to a small margin, 
independent of the clock frequency. 

Synchronised sampling improves the accuracy of 
clock-skew estimation, especially for low-resolution 
timestamps, such as the 1 Hz timestamp of the HTTP 
protocol. It not only improves the attack proposed by 
Murdoch, but also opens the door for new clock-skew 
based attacks which were previously infeasible. Further- 
more, our technique could be used to improve the iden- 
tification of hosts based on their clock skew as proposed 
by Kohno et al. [4] if active measurement is possible. 

In this paper we propose an algorithm for synchro- 
nised sampling and evaluate it in different scenarios. We 
show that synchronisation can be achieved, and main- 
tained despite network jitter, for different timestamp 
sources. Our evaluation results demonstrate that syn- 
chronised sampling significantly reduces the quantisa- 
tion error by up to two orders of magnitude. The greatest 
improvement is achieved for low-frequency timestamps 
over low network jitter paths. 

The paper is organised as follows. Section 2 intro- 
duces the concept of hidden services and describes the 
threat model and current attacks. Section 3 provides nec- 
essary background about remote clock skew estimation. 
Section 4 describes new attacks possible using synchro- 
nised sampling and explains how HTTP timestamps are 
used for clock skew estimation. Section 5 describes our 
proposed synchronised sampling technique. In Section 
6 we show the improvements of synchronised sampling 
over random sampling in a number of different scenarios. 
Section 7 concludes and outlines future work. 


2 Revealing Hidden Services 


In this paper we focus on the Tor network [1], the lat- 
est generation from the Onion Router Project [5]. Tor 
is a popular, deployed system, suitable for experimenta- 
tion. As of January 2008 there are about 2500 active Tor 
servers. Our results should also be applicable to other 
low-latency hidden service designs. 


2.1 Threat Model 


We will assume that the attacker’s goal is to link the 
hidden service pseudonym to the identity of its opera- 
tor (which in practice can be derived from the server IP 
address). The attacks we present here do not require con- 
trol of any Tor node. However, we do assume that our 
attacker can access hidden services, which means she is 
running a client connected to a Tor network. 

We also assume that our attacker has a reasonably lim- 
ited number of candidate hosts for the hidden service 
(say, a few hundred). To mask traffic associated with hid- 
den services, many of their hosts are also publicly adver- 
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tised Tor nodes, so this scenario is plausible. All of our 
attack scenarios, with one notable exception, require that 
the attacker can access the candidate hosts directly (via 
their IP address). To obtain timestamps, we assume the 
attacker is able to directly access either the hidden ser- 
vice, or another application running on the target. Again, 
since many hidden servers are also Tor nodes, it is plau- 
sible that at least the Tor application is accessible. 

Our attacker cannot observe, inject, delete or modify 
any network traffic, other than that to or from her own 
computer. 


2.2 Existing Attacks 


@verlier and Syverson [6] showed that a hidden service 
could be rapidly located because of the fact that a Tor 
hidden server selects nodes at random to build connec- 
tions. The attacker repeatedly connects to the hidden 
service, and eventually a node she controls will be the 
one closest to the hidden server. By correlating input and 
output traffic, the attacker can discover the server IP ad- 
dress. 

Murdoch and Danezis [7] presented an attack where 
the target visits an attacker controlled website, which in- 
duces traffic patterns on the circuit protecting the client. 
Simultaneously, the attacker probes the latency of all the 
publicly listed Tor nodes and looks for correlations be- 
tween the induced pattern and observed latencies. When 
there is a match, the attacker knows that the node is on 
the target circuit, and so she can reconstruct the path, al- 
though not discover the end node. 

Murdoch [3] proposed the most recent attack. The at- 
tacker induces an on/off load pattern on the target by fre- 
quently accessing the hidden service via the anonymisa- 
tion network during on-periods, and staying silent during 
off-periods. At the same time the attacker measures the 
clock skew changes of the set of candidate hosts. The 
induced load changes will cause temperature changes 
on the target, which in turn cause clock skew changes. 
Viewing the load inducement as covert channel, the at- 
tacker can send a pseudorandom bit sequence and com- 
pare it with the bit sequences recovered from all candi- 
dates through the clock skew measurements. Increasing 
the duration of the attack increases the accuracy to arbi- 
trary levels. 


3 Clock-Skew Estimation 


All networked devices, such as end hosts, routers, and 
proxies, have clocks constructed from hardware and soft- 
ware components. A clock consists of a crystal oscilla- 
tor that ticks at a nominal frequency and a counter that 
counts the number of ticks. The actual frequency of a 
device’s clock depends on the environment, such as the 
temperature and humidity, as well as the type of crystal. 
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It is not possible to directly measure a remote target’s 
true clock skew. However, an attacker can measure the 
offset between the target’s clock and a local clock, and 
then estimate the relative clock skew. For a packet i in- 
cluding a timestamp of the target’s clock received by the 
measurer, the offset 6; is [3]: 


ty; 


Oj = ij —tr, = Scly; + [ s(odr—cijn—a, (1) 
0 


where #; is the estimated target timestamp (including 
quantisation error), f,; is the (local) time the packet was 
received, s, is the constant clock-skew component, the 
integral over s(t) is the variable clock skew component, 
ci/h is the quantisation noise (for random sampling) and 
d; is the network delay. 

The constant clock skew is estimated by fitting a line 
above all points 6; while minimising the distance be- 
tween each point and the line above it using the linear 
programming algorithm described in [8]. This leaves the 
variable part of the clock skew and the noise. To estimate 
the variable clock skew per time interval, we can use the 
same linear programming algorithm for each time win- 
dow w. 

Figure 1 shows an example of a clock skew measure- 
ment across the Internet. The target was 22 hops away 
with an average Round Trip Time (RTT) of 325 ms. The 
target was a PC with Intel Celeron 2.6 GHz CPU running 
FreeBSD 4.10, and measurements were taken from the 
TCP timestamp clock, which has a frequency of | kHz. 
No additional CPU load was generated on the target dur- 
ing the measurement. 

The constant clock skew s, has already been removed. 
The grey dots (-) are the offsets between the two clocks, 
the green line (—) on top is the piece-wise estimation 
of the variable skew and the blue triangles (A) are the 
negated values of the derivative of the variable clock 
skew (the negated clock skew change). 

The noise apparent in the figure has two components: 
the network jitter (on the path from the target to the at- 
tacker) and the quantisation error. Note that the network 
jitter also contains noise inherent in measuring when 
packets are received by the attacker, and noise caused by 
variable delay at the target between generating the times- 
tamps and sending packets. In Figure 1, we can clearly 
see the | ms quantisation noise band below the estimated 
slope, caused by the target’s 1 kHz clock. Offsets below 
this band were also affected by network jitter. 

The samples close to the slope on top are the samples 
obtained immediately after a clock tick (with negligible 
network jitter). The samples at the bottom of the quanti- 
sation noise band are samples obtained immediately be- 
fore the clock tick. With the linear programming algo- 
rithm, only the samples close to the slope on top con- 
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Figure 1: Estimating the variable clock skew 


tribute to the accuracy of the measurement. Assuming 
an uncongested network, the network jitter is skewed to- 
wards zero and small even on long-distance links (see 
Figure 10). The quantisation noise is inversely propor- 
tional to frequency of the target’s clock. Depending on 
the source of timestamps used and the target’s operating 
system, the clock frequency is typically between | Hz 
and 1 kHz (resulting in a noise band between 1s and 
lms). If the target does not expose a high-frequency 
clock, the quantisation noise can be significantly larger 
than the noise caused by network jitter. 


To increase the accuracy of the measurement in the 
presence of high quantisation noise, w must be set to 
larger values, as the probability of getting samples close 
to the slope on top increases with the number of sam- 
ples. However, large w only allow very coarse measure- 
ments. Oversampling provides more fine-grained results 
while keeping w large to minimise the error. Without 
oversampling the time windows do not overlap and the 
start times of windows are S = {0,w,2w,...,nw}. With 
oversampling, the windows overlap and hence the win- 
dows start at times S = {0,w/o,2w/o,...,nw/o}, where o 
is the oversample factor. 


However, even with large values of w, over-sampling 
has a number of drawbacks. The first estimate is obtained 
after w/2 (regardless of 0), meaning that for large w it is 
impossible to get estimates close to the start and end of 
measurements. Furthermore, large w make it impossi- 
ble to accurately measure steep clock-skew changes. For 
example, such changes happen when a CPU load induce- 
ment is started and the temperature increases quickly [3]. 
Another disadvantage of oversampling is the increased 
computational complexity of O(o-n-w) compared to 
O(n-w) without over-sampling to obtain the same num- 
ber of clock skew estimates per time interval. 
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3.1 Timestamp Sources 


Previous research used different timestamps sources 
for clock-skew estimation: ICMP timestamp responses, 
TCP timestamp header extensions or TCP sequence 
numbers [3,4]. 

TCP sequence numbers on Linux are the sum of a 
cryptographic result and a | MHz clock. They provide 
good clock-skew estimates over short periods because of 
the high frequency, but re-keying of the cryptographic 
function every five minutes makes longer measurements 
non-trivial [3]. 

ICMP timestamps have a fixed frequency of 1 kHz. 
Their disadvantage is that they are affected by clock ad- 
justments done by the Network Time Protocol (NTP) [9], 
which makes estimation of variable clock skew more dif- 
ficult. Furthermore, ICMP messages are now blocked by 
many firewalls. 

TCP timestamps have a frequency between | Hz and 
1kHz, depending on the operating system. Their ad- 
vantage is that they are generated before NTP adjust- 
ments are made [4]. TCP timestamps are currently the 
best option for clock-skew measurement because they 
are widely available and unaffected by NTP (at least for 
Linux, FreeBSD and Windows [4]). However, even TCP 
timestamps are not available in all situations. They may 
not be enabled on certain operating systems and they can- 
not be used if there is no end-to-end TCP connection to 
the target. For example, they cannot be used through the 
Tor anonymisation network. 

HTTP timestamps have a frequency of 1 Hz and are 
available from every web server. However, these have 
not been previously used for clock-skew measurement 
due to the low frequency. We describe how to exploit 
them in the following section. 


4 New Attacks 


A major disadvantage of the attack in [3] is that the at- 
tacker needs to exchange large amounts of traffic with 
the hidden service across the Tor network in order to 
accurately measure clock skew changes. It may not be 
possible to actually send sufficient traffic because Tor 
does not provide enough bandwidth, or because the ser- 
vice operator actively limits the request rate to avoid 
overload, prevent Denial of Service (DoS) attacks etc. 
Furthermore, the attack also relies on an exposed high- 
frequency timestamp source (experiments used the | kHz 
TCP timestamp) on the target for adequate clock-skew 
estimation. 

The synchronised sampling technique proposed in this 
paper improves the existing attack reducing the dura- 
tion and amount of network traffic required. Measur- 
ing clock-skew requires only a small amount of traf- 
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fic compared to the amount of traffic needed for load 
inducement. For example, the exchange of one re- 
quest/response every 1.5 s is sufficient for clock-skew es- 
timation. 

Our improvements also make the existing attack ap- 
plicable in situations where high-resolution timestamps 
are not available. For example ICMP or TCP times- 
tamps (see Section 3.1) are not available across Tor, since 
it only supports TCP and streams are re-assembled on 
the client, removing any headers. Because our proposed 
technique allows accurate clock-skew estimation from 
low-resolution timestamps, HTTP timestamps obtained 
from a hidden web server across the Tor network could 
be used. The fact that low-resolution timestamps are us- 
able opens the door to new variants of the attack. 

In the first new attack variant, the attacker measures 
the variable clock skew of the hidden service via Tor, and 
of all the candidate hosts via accessing the IP addresses 
directly. Then the attacker compares the variable clock- 
skew pattern of the hidden service with the patterns of 
all the candidates. The variable clock skew patterns of 
different hosts differ sufficiently over time, and the du- 
ration of the attack could be increased arbitrarily. While 
this attack has the benefit of not requiring large amounts 
of traffic to be exchanged, it could still take a long time. 
The attacker must ensure that both timestamp sources are 
derived from the same physical source 

A quicker version of this attack could only compare 
the fixed clock-skew of the target measured via Tor with 
the fixed clock-skew measured directly for all candidates. 
Kohno et al. showed that clock skew of a particular host 
changes very little over time, but the difference between 
different hosts is significant [4]. 

Another new attack variant is based on the idea of us- 
ing clock-skew estimates for geo-location [3]. The at- 
tacker identifies the location of the candidates based on 
their IP addresses and a geo-location database. For ex- 
ample, GeoLite is a freely available database that maps 
IP addresses to locations with a claimed accuracy of over 
98% [10]. The attacker measures the variable clock skew 
of the hidden service via Tor. The attacker then estimates 
the location based on the variable clock-skew pattern us- 
ing the technique described in [3]. 

This attack works even in cases where the attacker 
cannot access the candidate hosts directly. On the other 
hand this attack does not allow an unambiguous identi- 
fication of the hidden service if candidate locations are 
geographically close together. 


4.1 Attacking HTTP Timestamps 


The key common factor among the new attacks discussed 
above is that clock skew must be estimated from re- 
sponses sent over the hidden service channel. Previ- 
ous work has not examined this option because typically 
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Figure 2: HTTP request/response and the timestamps 
needed for clock skew estimation 


only a low frequency clock is available and quantisation 
noise dominates the small effect of temperature on clock 
skew. However, in this paper we will show how it is 
still possible to exploit this clock. Here, the attacker acts 
as HTTP client sending minimal HTTP requests to the 
target. Standard web servers include a 1 Hz timestamp 
in the Date header of HTTP responses because it was 
recommended for HTTP 1.0 [11] and is mandatory for 
HTTP 1.1 (excluding 5xx server errors and some 1 xx re- 
sponses) [12]. 

Before the HTTP exchange a TCP connection needs 
to be established between client and server. Including 
TCP connection establishment, it may take at least two 
full round trip times from when the client wants to send 
a request until the response is received. To minimise the 
TCP overhead, the client should open a TCP connection 
beforehand. Ideally, it should only open the connection 
once and then keep it alive for the duration of the mea- 
surement. However, it is not possible for a client to force 
a server to keep a connection open. Therefore, when the 
client notices that the server has closed the connection, 
it should immediately re-open it. Then, the next HTTP 
request can be sent at the appropriate time determined by 
the synchronised sampling algorithm. 

The HTTP timestamp is usually generated after the 
server has received the client’s request. We verified that 
Apache 2.2.x generates the timestamp after the request 
has been successfully parsed [13]. The corresponding 
client timestamp is the time the packet containing the 
Date header is received, which is usually the first packet 
sent by the server after the TCP connection has been fully 
established (see Figure 2). 


5 Implementation 


Previous approaches to remote clock-skew estimation 
have sampled timestamps at random times. However, 
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with the 1Hz HTTP timestamp, the consequent high 
quantisation noise would prevent the attack from accu- 
rately measuring the target’s clock skew within a feasible 
period. Instead, we must probe the target’s clock imme- 
diately after a clock tick occurred, because here the quan- 
tisation error is the smallest. To achieve this, the attacker 
has to synchronise its probing phase and frequency such 
that probes arrive shortly after the clock tick. We as- 
sume the attacker selects a nominal sample frequency, 
based on the desired accuracy and intrusiveness of the 
measurement. 

The attacker cannot measure the exact time difference 
between the arrival of probe packets and the clock ticks. 
To maintain synchronisation, the attacker has to alternate 
between sampling before and after the clock tick. Sam- 
ples before the clock tick can be corrected by adding one 
tick, as their true value is actually closer to the next clock 
tick. However, the linear programming algorithm still 
cannot use these samples because for them the quantisa- 
tion error and jitter are in opposite directions and cannot 
be separated. 

Figure 3 illustrates the benefit of synchronised sam- 
pling over random sampling. The solid step line is the 
target’s clock value over time and the dashed line shows 
the true time. Random samples are distributed uniformly 
between clock ticks whereas synchronised samples are 
taken close to the clock ticks. Note that in the figure, the 
time for samples before the tick has been corrected as 
described above. The quantisation errors are the differ- 
ences between samples’ y-values and the true time. The 
absolute quantisation errors are shown as bars at the bot- 
tom. Synchronised sampling leads to smaller errors in 
comparison with random sampling. 

Our algorithm is similar to existing Phase Lock Loop 
(PLL) techniques used for aligning the frequency and 
phase of a signal with a reference [14]. However, 
whereas PLL techniques measured the phase difference 
of two signals, we can only estimate the phase difference 
by detecting whether a sample was taken before or after 
the clock tick. 


5.1 Algorithm 


Initially, the attacker starts probing with the nominal 
sample frequency and measures how many clock ticks 
occur in one sample interval (target_ticks_per_interval). 
The measurement is repeated to obtain the correct num- 
ber of ticks. 

The attacker cannot measure the exact time difference 
between the arrival of a probe packet and the target’s 
clock tick. However, the attacker can measure the po- 
sition of the probe arrival relative to the target’s clock 
tick based on the number of clock ticks that occurred 
between the current and the last timestamp of the tar- 
get (ticks_diff). If the number of clock ticks is less 
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Figure 3: Advantage of synchronised sampling over ran- 
dom sampling 


than target_ticks_per_interval, the sample was taken be- 
fore the tick and vice versa. If ticks_diff equals tar- 
get_ticks_per_interval the position is left unchanged. At 
the start of the measurement the position is not known 
and the attacker needs to continuously increase or de- 
crease the probe interval until a change occurs (initial 
phase lock). 


The probe interval (the reciprocal of the probe fre- 
quency) is controlled using the following mechanism 
(see Algorithm 1). The probe interval is adjusted based 
on the position errors each time a position change oc- 
curs and the previous position was known using a Pro- 
portional Integral Derivative (PID) controller [15]. PID 
controllers base the adjustment not only on the last er- 
ror value, but also on the magnitude and the duration of 
the error (integral part) as well as the slope of the error 
over time (derivative part). K,, K; and Kg are pre-defined 
constants of the PID controller. 

Alternatively, the linear programming algorithm [8] 
could be used to compute the relative clock skew be- 
tween attacker and target based on a sliding window of 
timestamps. The probe interval is then adjusted based on 
the estimated relative clock skew. This technique works 
well if the estimates are fairly accurate, which is the case 
for high-frequency clocks and low network jitter. 


Algorithm 1 Probe interval control 





function probe_interval_adjustment(pos, last_pos) 
if pos != last_pos and pos != UNKNOWN and 
last_pos != UNKNOWN then 
return Ky: Clast_adj_before + last_adj_behind) + 
K;-Cinteg_adj_before + integ_adj_behind) + 
Kq:(deriv_adj_before + deriv_adj_behind) 
else 
return 0 
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In order to maintain the synchronisation, the attacker 
has to enforce regular position changes. This is done 
by modifying the time the next probe is sent. If the 
current position is before the clock tick the send time 
of the next probe is increased based on the last adjust- 
ment last_adj_before. If the current position is behind 
the clock tick the next probe send time is decreased based 
on the last adjustment last_adj_behind. The adjustments 
are modified based on how well the attacker is synchro- 
nised to the target. 


If a change of position occurs between two samples, 
the difference between the arrival of the probe packet 
and the target’s clock tick is smaller than the last ad- 
justment and therefore the next adjustment is decreased. 
If no position change occurs the error is assumed to be 
larger than the last adjustment and the next adjustment 
is increased. The initial probe send-time adjustment is a 
pre-defined constant. Algorithm 2 shows the probe send 
time adjustment algorithm. a and £ are pre-defined con- 
stants that determine how quickly the algorithm reacts 
(0<a<landfB>1). 


Algorithm 2 Next probe send time adjustment 





function next_probe_time_adjustment(pos, last_pos) 
if pos = BEFORE then 
last_adj = last_adj_before 
else 
last_adj = last_adj_behind 


if pos != last_pos then 
return a-last_adj 
else 
return £-last_adj 





The probe frequency and send time adjustments are 
limited to a range between pre-defined minimum and 
maximum values to avoid very small or very large 
changes. 


Loss of responses is detected using sequence num- 
bers. A sequence number is embedded into each probe 
packet such that the target will return the sequence num- 
ber in the corresponding response. The actual field de- 
pends on the protocol used for probing. For example, for 
ICMP the sequence number is the ICMP Identification 
field whereas for TCP the sequence number is the TCP 
sequence number field. 


For HTTP it is not possible to embed a sequence num- 
ber directly into the protocol. Instead, sequence numbers 
are realised by making requests cycling through a set of 
URLs. A sequence number is associated with each URL 
and HTTP responses are mapped to HTTP requests using 
the content length assumed to be known for each object. 
This technique assumes there are multiple objects with 
different content lengths accessible on the web server. 
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If packet loss is detected, the algorithm adjusts 
ticks_diff by subtracting the number of lost packet mul- 
tiplied by target_ticks_per_interval. Reordered packets 
are considered lost. 


Algorithm 3 shows the synchronisation procedure. 
Our algorithm works with different timestamps and dif- 
ferent clock frequencies. It has been tested with ICMP, 
TCP and HTTP timestamps and TCP clock frequencies 
of 100 Hz, 250 Hz and 1 kHz. 


Algorithm 3 Synchronised sampling 





foreach response_packet do 
diff = target_timestamp - last_target_timestamp 


if target_ticks_per_interval == -1 then 
pos = UNKOWN 
target_ticks_per_interval = ticks_diff 
else if ticks_diff > target_ticks_per_interval then 


pos = BEHIND 

else if ticks_diff < target_ticks_per_interval then 
pos = BEFORE 

else 


pos = last_pos 


probe_interval = probe_interval + 
probe_interval_adjustment(pos, last_pos) 
probe_time = last_probe_time + probe_interval + 
next_probe_time_adjustment(pos, last_pos) 


last_pos = pos 
last_target_timestamp = target_timestamp 
last_probe_time = probe_time 





5.2 Errors 


Any constant delay does not affect the synchronisation 
process. However, changes in delay on the path from 
the attacker to the target will affect the arrival time of 
the probe packets. This could be caused by jitter in the 
sending process (jitter inside the attacker), network jitter 
(queuing delays in routers) or jitter in the target’s packet 
receiving process. 


Often we can assume the network is uncongested and 
therefore network jitter is skewed towards zero. This 
is usually the case in a LAN (see Section 6). Even on 
the Internet many links are not heavily utilised and path 
changes (caused by routing changes) are usually infre- 
quent. Load-balancing is usually performed on a per- 
flow basis to eliminate any negative impacts on TCP and 
UDP performance. 


However, when measuring clock skew over a Tor cir- 
cuit we expect much higher network jitter. A Tor cir- 
cuit is composed of a number of network connections 
between different nodes. The overall jitter does not only 
include the jitter of each connection but also the jitter in- 
troduced by the Tor nodes themselves. 


USENIX Association 


The timing of sending probes is not very exact if the 
sender is a userspace application. Even if the userspace 
send() system call is called at the appropriate time, there 
will be a delay before the packet is actually sent onto 
the physical medium. The variable part of this delay can 
cause probe packets to arrive too late or too early. This 
error could be reduced by running the software entirely 
in the kernel (e.g. as kernel module), using a real-time 
operating system or by using special network cards sup- 
porting very precise sending of packets. Any variable 
delay in the packet receiving process of the target has 
the same effect and is unfortunately out of control of the 
attacker. The only way an attacker could reduce such er- 
rors would be to adjust the sending of the probe packets 
based on a prediction of the jitter inside the target, which 
appears to be a challenging task. 

Another error is introduced when the relative clock 
skew between attacker and target changes and the algo- 
rithms needs to adjust the probe frequency. The attacker 
is able to control its time keeping and avoid any sudden 
clock changes. But if the target is running NTP and the 
timestamps are affected by NTP adjustments, changes in 
relative clock skew are possible. 


6 Evaluation 


In the first part of this section we compare the accuracy of 
synchronised and random sampling in a LAN testbed us- 
ing TCP timestamps with typical target clock frequencies 
of 100 Hz, 250 Hz and 1000 Hz, as well as 1 Hz HTTP 
timestamps. Since in the LAN network jitter is negligi- 
ble the results show the maximum improvement of using 
synchronised sampling and demonstrate that our imple- 
mentation is working correctly. 

In the second part we compare the accuracy of syn- 
chronised and random sampling based on TCP times- 
tamps across a 22-hop Internet path. The result shows 
that even on a long path, synchronised sampling signif- 
icantly increases clock-skew estimation accuracy, which 
improves on the attack proposed by Murdoch [3]. 

In the third part we compare the accuracy of synchro- 
nised and random sampling for probing a web server run- 
ning as a Tor hidden service. We show that synchronised 
sampling improves clock skew estimation significantly, 
even over a path with high network jitter. We also show 
that a hidden web sever can be identified among a can- 
didate set by comparing the variable clock skew over 
time using synchronised sampling. Furthermore, using 
synchronised sampling shows daily temperature patterns 
that could not be identified using random sampling. 

Finally, we investigate how long our technique needs 
for the initial synchronisation (the time until the attacker 
has locked on to the phase and frequency of the target’s 
clock ticks). We compare the times for HTTP probing 
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in a LAN and probing a hidden web server over a Tor 
network. 

To evaluate the accuracy of synchronised and random 
sampling we need to know the true values of the variable 
clock skew. Since it is impossible to directly measure 
this, we use the following approach. In our tests the tar- 
get also runs a UDP server and the attacker runs a UDP 
client. The UDP client sends requests to the server at reg- 
ular time intervals. Upon receiving a request, the UDP 
server returns a packet with a timestamp set to the send 
time of the response. The UDP client records the time it 
receives the response. 

We compute the offset between the two timestamp- 
series and estimate the variable clock skew as usual. 
Since the UDP-based timestamp has a precision of | us, 
the quantisation error is negligible. Although these UDP 
estimates are not the true values of the variable clock 
skew we use them as baseline for synchronised and ran- 
dom sampling, which have much higher quantisation er- 
rors. In the following we refer to this as UDP probing or 
UDP measurement. 

A drawback of our current implementation is that the 
UDP server is a userspace program. The server’s re- 
sponse timestamp is taken in userspace before the re- 
sponse packet is sent via the sendto() system call. To 
reduce these timing errors one could implement a kernel- 
based version of the UDP server. 

We compare the variable skew estimates for synchro- 
nised and random sampling with the reference values, 
from the UDP measurement, using the root mean square 
error (RMSE) of the data values x against the reference 


values %: 
1 
RMSE = | 7 Di ~x;). (2) 


We also compute histograms of the noise band for syn- 
chronised and random sampling. The noise is defined 
as difference between the variable clock offset and the 
UDP timestamp estimated variable skew. For random 
sampling the quantisation noise band is always uniform 
with width 1/f, where f is the clock frequency. For syn- 
chronised sampling the quantisation noise depends on 
how well the synchronised sampling algorithm is able to 
track the target’s clock tick. For synchronised sampling 
the noise is given by the samples taken after the clock 
tick because only these samples are used to estimate the 
clock skew. 

In all experiments we set a = 0.5 and 6 = 1.5. For 
TCP timestamps the linear programming algorithm was 
used to adjust the probe interval with a sliding window of 
size 120 (LAN) and 300 (Internet). For HTTP timestamp 
measurements the probe interval was adjusted using the 
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PID controller with K, = 0.09, K; = 0.0026 and Kg = 
0.02. 


6.1 Synchronised vs. Random Sampling in 
LAN Environments 


The attacker was a PC with Intel Xeon 3.6 GHz Quad- 
Core CPU running Linux 2.6. The target was a PC with 
Intel Xeon 3.6GHz Quad-Core CPU running Linux 2.6 
with a TCP timestamp frequency of 1 kHz. Attacker and 
target were connected to the same Ethernet switch. The 
attacker simultaneously performed synchronised, ran- 
dom and UDP probing. Synchronised and random prob- 
ing had an average sampling period of 1.5 s, the same rate 
as in [3]. The UDP probing was performed with a faster 
sample rate of 1s in order to achieve a higher accuracy 
for the reference measurement. A second UDP measure- 
ment with an average sample rate of 1 s was run in order 
to investigate the error between UDP measurements. The 
duration of the test was approximately 24 hours. 

As the test was run inside a LAN the average RTT was 
only 130 us and the RTT / 2 jitter was small with a max- 
imum of 60 us and a median of 30s. Figure 4 shows 
histograms of the noise bands of synchronised and ran- 
dom sampling with respect to the reference given by the 
UDP measurement. For synchronised sampling most of 
the offsets are within 100 us whereas for random sam- 
pling we see the expected | ms noise band. 

In Figure 5 we compare the RMSE of synchronised 
sampling, random sampling and the second UDP mea- 
surement for different window sizes against the UDP ref- 
erence with maximum window size (1800 s). We also 
compare the UDP reference against itself at smaller win- 
dow sizes. The oversampling factor was chosen such that 
the time between two clock-skew estimates is the same 
regardless of the window size (30s). This has the advan- 
tage of providing approximately the same number of es- 
timates for all window sizes. (For smaller window sizes 
there are still more samples, because there are samples 
closer to the start and end of the measurement period.) 

Figure 5 shows that synchronised sampling performs 
significantly better than random sampling. There is a dif- 
ference between the second UDP measurement and the 
UDP reference, but it is smaller than the difference be- 
tween synchronised sampling and the UDP reference for 
all window sizes. Hence we conclude the error of UDP 
measurements is sufficiently small for using it as base- 
line. In the later experiments we performed only one 
UDP measurement. 

The target clock frequency was 1 kHz, which is the 
maximum TCP timestamp frequency of current operat- 
ing systems. However, it is likely that in reality many 
hosts actually have lower TCP clock frequencies. For ex- 
ample, 100 Hz is the clock frequency used by older Linux 
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Figure 4: Noise distributions in LAN: synchronised sampling (left) vs. random sampling (right) 
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Figure 5: RMSE of synchronised, random sampling and 
UDP reference in LAN with a target clock frequency of 
1 kHz (log y-axis) 


and FreeBSD kernels and 250 Hz is the clock frequency 
of modern Linux 2.6 kernels. 

To evaluate the RMSE for lower clock frequencies we 
used the same setup. This time we ran three synchronised 
and three random probing processes simultaneously for 
24 hours, rounding the target timestamps so that we ef- 
fectively measured 100Hz, 250Hz and 1 kHz clocks. 
Figure 6 shows the RMSE for synchronised and random 
sampling for the different target clock frequencies. The 
UDP measurement has been omitted for better readabil- 
ity. The graph shows that the accuracy of synchronised 
sampling does not depend on the clock frequency and 
the RMSE for random sampling increases significantly 
for lower clock frequencies. 

In another LAN experiment we ran a web server 
(Apache 2.2.4) on the target and the attacker used HTTP 
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Figure 6: RMSE of synchronised and random sampling 
for different clock frequencies of 100Hz, 250Hz and 
1 kHz against the same target. Different clock frequen- 
cies were obtained by rounding the target timestamps im- 
mediately after reception (log y-axis) 


probing. The average sampling interval was 2s, because 
this is the minimum probe frequency for 1Hz HTTP 
timestamps. The web server was completely idle (except 
for the requests generated by the attacker). The duration 
of the experiment was approximately 24 hours. 


Although the experiment was carried out between the 
same two hosts as before, the RTT / 2 jitter was higher 
with a maximum of 120 us and a median of 60 us. The 
web server running in userspace introduced the addi- 
tional jitter. Figure 7 shows the noise for synchronised 
sampling and random sampling. For synchronised sam- 
pling the noise band is only slightly larger than in Figure 
4. Because of the higher jitter, the synchronisation is less 
accurate. For random sampling the noise band is | s be- 
cause of the 1 Hz HTTP clock frequency. 
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Figure 7: Noise distributions for HTTP probing in LAN: 
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Figure 8: RMSE of synchronised sampling, random sam- 
pling and the UDP measurement for HTTP probing in 
LAN 


Figure 8 shows the RMSE of synchronised sampling, 
random sampling and the UDP measurement against the 
reference at maximum window size. The RMSEs of syn- 
chronised sampling and UDP reference are very similar 
to the results in Figure 5. Because of the large noise 
band, the RMSE for random sampling is more than two 
orders of magnitude above the RMSE for synchronised 
sampling. This demonstrates that our new algorithm is 
able to effectively measure clock skew changes for low 
frequency clocks, an infeasible task for random sam- 


pling. 


6.2 Synchronised vs. Random Sampling 
Across Internet Paths 

The attacker was the same machine as in Section 6.1 

located in Cambridge, UK. The target was 22 hops 


away located in Canberra, Australia. The target was a 
FreeBSD 4.10 PC with a kernel tick rate set to 1000 and 
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therefore the TCP timestamp frequency was | kHz. The 
average RTT between measurer and target was 325 ms. 
The duration of the measurement was approximately 21 
hours. We performed synchronised, random and UDP 
probing. 

Despite the high RTT, the jitter is small and skewed 
towards zero as shown in Figure 10. Figure 9 shows 
histograms of the noise bands of synchronised and ran- 
dom sampling in relation to the reference given by the 
UDP measurement. For synchronised sampling most of 
the offsets are within 250 us of the reference whereas for 
random sampling there is the expected 1 ms noise band. 

Figure 11 shows the RMSE of synchronised sampling, 
random sampling and the UDP reference against the 
UDP reference at maximum window size using the same 
parameters as in Section 6.1. The gain of synchronised 
sampling is smaller compared to Section 6.1 because of 
the higher network jitter but still significant for smaller 
window sizes. 


6.3 Attacking Tor Hidden Services 


For our measurements we used a private Tor network. 
Our Tor nodes are distributed across the Internet running 
on Planetlab [16] nodes. The main reason for using a pri- 
vate Tor network instead of the public Tor network is the 
poor performance of hidden services in the public Tor 
network. Besides huge network jitter that prevents any 
accurate clock-skew measurements, hidden services al- 
ways disappeared after few hours preventing longer mea- 
surements. While currently it is difficult to carry out the 
attack in the public Tor network, it should become easier 
in the future, as the Tor team is now working on improv- 
ing the performance of hidden services. 

We selected 18 Planetlab widely geographically dis- 
tributed nodes on which we ran Tor nodes (of which 
3 were directory authorities). We selected nodes that 
had low CPU utilisation at the time of selection. An 
Intel Core2 2.4GHz with 4 GB RAM running Linux 
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Figure 9: Noise distributions for Internet path: synchronised sampling (left) vs. random sampling (right) 
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Figure 10: RTT jitter / 2 on path across the Internet 


2.6.16 was used to run another Tor node and the hidden 
web server. No load is induced on the server, so any 
clock skew changes are based on the ambient tempera- 
ture changes. 

An Intel Celeron 2.4GHz with 1.2 GB of RAM run- 
ning Linux 2.6.16 was used to run a Tor client and our 
probe tool. We used tsocks [17] with the latest Tor re- 
lated patches to enable our tool to interact with the Tor 
client via the SOCKS protocol and to properly handle 
Tor hidden server pseudonyms. 

First we performed an experiment similar to the ones 
in Section 6.1 and Section 6.2. Synchronised and random 
sampling was performed across the Tor network, while 
UDP probing was performed directly between the client 
machine and the hidden server. The measurement dura- 
tion was approximately 18 hours. 

The average RTT between client and hidden server 
across Tor was 885ms. Figure 14 shows the RTT / 2 
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Figure 11: RMSE of synchronised sampling and random 
sampling for different window sizes measured across In- 
ternet path. 


jitter, which is considerably higher than in the previ- 
ous measurements. Figure 12 shows histograms of the 
noise bands of synchronised and random sampling. For 
random sampling it shows the expected 1s noise band. 
For synchronised sampling the noise is greatly reduced. 
Most of the offsets are < 100ms away from the slope 
given by the UDP reference. 

Figure 13 shows the change of clock skew for synchro- 
nised sampling as blue squares (0) and random sampling 
as red circles (O) and the UDP reference as black line 
(—) for a window size of 1800s and 2 hours. The noise 
is much smaller for synchronised sampling compared to 
random sampling especially for small window sizes. For 
a window size of 2 hours one can clearly see a daily tem- 
perature change of the reference curve with the temper- 
ature (and hence the clock skew) dropping during night 
hours and suddenly increasing in the morning. The syn- 
chronised sampling curve shows the same pattern with 
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Figure 12: Noise distributions Planetlab Tor testbed: synchronised sampling (left) vs. random sampling (right) 
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Figure 13: Estimated clock skew changes for hidden service in Planetlab Tor network for a window size of 1800s 


(left) and 2 hours (right) 


added noise. An attacker could use such daily tempera- 
ture patterns to estimate the location of the target based 
on geo-location. In contrast to random sampling, the pat- 
tern is not clearly visible because of the much higher 
noise. 


In Figure 15 we compare the RMSE of synchro- 
nised sampling, random sampling and the UDP reference 
against the UDP reference at maximum window size. 
The RMSE of synchronised sampling is almost one mag- 
nitude lower than the RMSE for random sampling even 
for window sizes as large as two hours. 


In the second experiment we performed the actual at- 
tack. We treated all 19 Tor nodes as candidates and mea- 
sured their clock skew directly using TCP timestamps 
(synchronised sampling). At the same time we measured 
the clock skew of the hidden web service via Tor based 
on HTTP timestamps using synchronised and random 
sampling simultaneously. The experiment lasted about 
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ten hours. One of the nodes stopped responding in the 
middle of the experiment. 


Figure 16 shows the RMSE of the HTTP clock skew 
estimates obtained from the hidden service via Tor us- 
ing random sampling or synchronised sampling and TCP 
clock skew estimates of all candidate nodes. We used a 
window size of three hours and set the oversample factor 
so one clock estimate is obtained every 30s. (For smaller 
windows random sampling was not able to consistently 
select one candidate as the best and would alternate be- 
tween a few including the correct one for the whole du- 
ration of the measurement.) 


The RMSE of the HTTP timestamp estimate and the 
correct candidate is shown as thick grey (red on colour 
display) line while RMSEs for all other candidates are 
shown as thin black lines. 


The RMSE between the synchronised sampling Tor 
measurement and the direct measurement of the correct 
candidate is very small, and with increasing duration be- 
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Figure 14: RTT jitter / 2 over Planetlab Tor testbed 
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Figure 15: RMSE of synchronised, random sampling and 
UDP reference for hidden web service in Planetlab Tor 
network (log y-axis) 
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Figure 16: RMSE of HTTP clock skew estimates obtained from hidden service via Tor using random sampling (left) 
or synchronised sampling (right) and TCP clock skew estimates of all candidate nodes 


comes significantly smaller than the RMSE of the Tor 
measurement and all the other candidates except one. For 
random sampling all RMSEs are fairly high indicating 
that there is no good match of the variable clock skew 
of the Tor hidden service with any of the candidates. In 
the second half of the experiment the RMSE of the cor- 
rect candidate becomes smallest, but only by a very small 
margin. 

Synchronised sampling is able to identify the correct 
candidate much faster than random sampling, needing 
only 139 minutes compared to 287 minutes. These times 
are from the start of the measurement until the RMSE of 
the correct candidate becomes smallest. They include the 
initial 1.5 hours it takes to get the first clock skew esti- 
mate (because of the three hour windows), which is not 
included in Figure 16. 
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While the variable clock skew of the TCP clock and 
userspace clock (HTTP timestamps) are a good match 
the fixed skew of the two clocks differs on our Linux 
2.6.16 box running the hidden server. This makes it im- 
possible to evaluate an identification of the hidden server 
based on the fixed skew. However, since we know the 
true fixed skew of the userspace clock, we can anal- 
yse how long it takes to get an estimate using synchro- 
nised and random sampling of the HTTP clock. We use 
the data from the previous measurement and assume the 
skew estimate is correct if within 0.5 parts per million 
of the true value. Again synchronised sampling outper- 
forms random sampling, needing only 23 minutes com- 
pared to 102 minutes. 
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Figure 17: Initial synchronisation for HTTP probing in LAN (left) and probing a hidden web server over the Tor 


network (right) 


6.4 Initial synchronisation time 


We briefly analyse the initial synchronisation time of our 
technique. The initial synchronisation is the time it takes 
until the attacker has locked on to the phase and fre- 
quency of the target’s clock ticks. 

Figure 17 plots the values of adj_before, adj_behind 
and probe_interval (see Section 5 for the meaning of the 
variables) over the number of clock samples (taken from 
the target’s clock every 2s). The y-axis range is limited to 
between —10 ms and 10 ms and the x-axis is limited to the 
first 1000 clock samples. Note that before adjustments 
are always positive, while behind adjustments are always 
negative. 

In the LAN experiment initial synchronisation is es- 
tablished after only about 40 samples (roughly 1.5 min- 
utes) and further adjustments and probe interval changes 
are small (less than 500us and 100s respectively). 
When probing over the Tor network synchronisation is 
more difficult because of the much higher network jitter. 
Consequently initial synchronisation takes longer (about 
70 clock samples or roughly 2.5 minutes) and the algo- 
rithm is forced to make larger adjustments and probe in- 
terval changes (of up to several milliseconds). 


7 Conclusions and Future Work 


In this paper we have presented and evaluated an 
improved technique for remote clock-skew estimation 
based on the idea of synchronised sampling proposed by 
Murdoch [3]. The evaluation shows that our new algo- 
rithm provides more accurate clock skew estimates than 
the previous random sampling based approach. Espe- 
cially if the target clock frequency is low, accuracy im- 
proves by up to two orders of magnitude. Since the ac- 
curacy of our synchronised sampling technique is inde- 
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pendent of the target’s clock frequency, it is possible to 
estimate variable clock-skew from low-resolution times- 
tamps. 

Our technique does not only improve the previously 
proposed clock-skew related attack on Tor [3], it also 
opens the door for new variants of the attack, which we 
have described in the paper. Our technique could also be 
used to improve the identification of hosts based on their 
clock skew as proposed in [4] if active measurement is 
possible. 

Currently our Tor test network is fairly small and only 
has one hidden server. While we showed that our new 
proposed attacks work in principle, we did not provide a 
comprehensive evaluation. In future work we plan to ex- 
tend our test network and add more hidden servers. This 
will allow us to perform a more detailed evaluation in- 
cluding analysing the sensitivity and specificity of our 
attack based on the different parameters. 

The synchronised sampling implementation could be 
further improved by fine-tuning the algorithm param- 
eters. Our current implementation runs in userspace, 
which naturally limits the ability to exactly time probe 
packets. A kernel implementation, using network cards 
capable of high-precision traffic generation, or use of a 
real-time kernel, could achieve higher accuracy. 
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Abstract 


In User-Based Network Services (UBNS), the process 
servicing requests from user U runs under U’s ID. This 
enables (operating system) access controls to tailor ser- 
vice authorization to U. Like privilege separation, UBNS 
partitions applications into processes in such a way that 
each process’ permission is minimized. However, be- 
cause UBNS fundamentally affects the structure of an 
application, it is best performed early in the design pro- 
cess. 

UBNS depends on other security mechanisms, most 
notably authentication and cryptographic protections. 
These seemingly straightforward needs add considerable 
complexity to application programming. To avoid this 
complexity, programmers regularly ignore security is- 
sues at the start of program construction. However, after 
the application is constructed, UBNS is difficult to ap- 
ply since it would require significant structural changes 
to the application code. 

This paper describes easy-to-use security mechanisms 
supporting UBNS, and thus significantly reducing the 
complexity of building UBNS applications. This sim- 
plification enables much earlier (and hence more effec- 
tive) use of UBNS. It focuses the application developer’s 
attention on the key security task in application develop- 
ment, partitioning applications so that least privilege can 
be effectively applied. It removes vulnerabilities due to 
poor application implementation or selection of security 
mechanisms. Finally, it enables significant control to be 
externally exerted on the application, increasing the abil- 
ity of system administrators to control, understand, and 
secure such services. 


1 Introduction 


Computer networking was designed in a different era, in 
which computers were kept in locked rooms and com- 
munication occurred over leased lines, isolating systems 
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from external attackers. Then, physical security went a 
long way in ensuring adequate computer security. To- 
day, however, attackers can remotely target a computer 
system from anywhere in the world over the Internet. 

Given that physical separation is no longer an alterna- 
tive, securing networked applications requires isolation 
of a different form, including in general: 


1. authentication of both users and hosts; 


2. protection of communication confidentiality and in- 
tegrity; and 


3. authorization (also known as access controls) using 
least privilege [32]. 


The first two tasks are typically provided for within the 
application, for example by using SSL [12] or Kerberos 
[38]. The last task is ideally enforced by the Operating 
System (OS), since then failures in the application (e.g., 
a buffer overflow) do not bypass authorization. 

But (1) and (2) are complicated by Application Pro- 
gram Interfaces (APIs) which are both difficult and te- 
dious to use; for example, in addition to the basic au- 
thentication mechanism, it is necessary to communicate 
information from client to server (perhaps using GSSAPI 
[24]), interface to PAM [33], and the OS. The application 
programmer must choose from a large variety of authen- 
tication techniques (e.g., password or public-key), and 
compensate for their weaknesses. Since complexity is 
the enemy of security, it is especially important to avoid 
complexity in security critical code. And authentication 
is often attacked, for example, password dictionary at- 
tacks against SSH!, as well as the implementations of 
authentication’. 

Consider the dovecot IMAP server. Over 9,000 lines 
are devoted to (1) and (2), consuming 37% of the IMAP 
service code (see Section 6 for details). Clearly, this is a 
large burden on application developers, and as we shall 
show, unnecessary. In contrast, the partitioning of the 
application into processes, and their attendant privileges, 
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is a concern of application programmers since it is fun- 
damental to program structure. The impact of this parti- 
tioning includes the number and purposes of processes, 
the privileges associated with processes, the communica- 
tion between processes, the organization of data the pro- 
cesses access, the data and operations which must be per- 
formed within a process, the sequencing of operations, 
and the security vulnerabilities. For a general discussions 
of these issues, see [4]. 

One important way of partitioning network services is 
by the remote user U they serve. That is, a server process 
which receives requests from U runs under U’s user ID, 
so that its “ownership” is visible to, and limited by, exter- 
nal authorization. Although this scheme is widely used, 
we don’t know a term for it, so we shall call it User- 
Based Network Services (UBNS). UBNS is used, for ex- 
ample, in dovecot, SSH, and qmail. It prevents a user’s 
private data from being commingled with other user’s 
data and provides the basis for OS authorization. The lat- 
ter enables system administrators to be able to configure 
secure services easily. Given the many sources of service 
code—and frequent releases of the services—it is highly 
desirable to move the security configuration and enforce- 
ment outside the service. This minimizes the harm that 
errant services can do, reduces the need to understand 
(often poorly documented) application security, enables 
strong protections independent of service code, is more 
resilient in the presence of security holes, and vastly in- 
creases the effectiveness of validating service security. 

Despite the advantages of UBNS, authentication is of- 
ten performed in a service-specific way or not at all. 
A prime example is the Apache web server (and most 
other web servers). In Apache, the users are not visible 
to the OS. The crucial independent check provided by 
OS-based authorization is lost. And application devel- 
opers often avoid service-specific authentication, due to 
the complexity it engenders. Hence, an application’s ini- 
tial design often forgoes security concerns which then 
must be retrofitted after the fact [13]. But retrofitting 
UBNS requires restructuring and re-implementing sub- 
stantial portions of the application. And since it is diffi- 
cult to restructure existing applications, the service may 
never be made into a UBNS. 

If UBNS were easier to implement at the application 
level, it could be integrated from the beginning of system 
design. Application complexity would be decreased and 
security would be improved. In this paper, we describe 
how to radically reduce complexity in UBNS service us- 
ing netAuth—our network authentication and authoriza- 
tion framework. In netAuth, a service requires only 4 
lines of code to implement authentication and 0 for en- 
cryption and authorization. Hence, netAuth 


1. allows authenticated services to be easily integrated 
and 
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2. enables requests for the same user to be directed to 
the same back-end server. 


The first is essential to support UBNS. The second makes 
it easier to re-use per user processes, removing the need 
for concurrent programming while increasing system ef- 
ficiency. In addition, these mechanisms enable more 
modular construction of applications. 

We describe NetAuth APIs and the implementations 
it gives rise to. By making these mechanisms almost 
entirely transparent, an application developer adds only 
minimal code to use these mechanisms. We describe 
sufficient networking interfaces to support UBNS and 
describe their implementations. These mechanisms are 
quite simple and thus are easy to use. The protections 
provided are also considerably stronger than those in 
most applications. We then describe a port of a UBNS 
service, dovecot to netAuth, and the substantial savings 
of code, simplifications to process structure, and reduced 
attack surface of this port. 

The remainder of the paper is organized as follows: 
Section 2 describes related work. Section 3 describes the 
overview of our system. We then describe our system in 
more depth: Section 4 describes how our authentication 
mechanism can be used to write application. Section 5 
describes briefly our implementation and some perfor- 
mance numbers. In Section 6, we describe the experi- 
ence of porting dovecot to netAuth. Section 7 discusses 
the security achieved and finally we conclude. 


2 Related work 


UBNS is not the only way to partition a service into mul- 
tiple processes. Another complementary way is privi- 
lege separation [29|—in which an application is parti- 
tioned into two processes, one privileged and one un- 
privileged. For example, the listening part of the ser- 
vice which performs generic processing—initialization, 
waiting for new connections, etc. is often run as root 
(i.e., with administrative privileges) because some ac- 
tions need these privileges (for example, to read the file 
containing hashed passwords or to bind to a port). Un- 
fortunately, exploiting a security hole in a root level pro- 
cess fully compromises the computer. By splitting the 
server into two processes, the exposure of a root level 
process is minimized. In contrast to UBNS, retrofitting 
privilege separation is not difficult, and there exists both 
libraries [20] and compiler techniques [6] to do it. Both 
UBNS and privilege separation are design strategies to 
maximize the value of least privilege [32]. 

SSH is a widely used UBNS service [42, 29], but is ill- 
suited to implement UBNS services—such as mail, cal- 
endaring, source control systems, remote file systems— 
because of the way network services are built. In the 
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network case, the listening process exists before the con- 
nection is made and must at connect time know what user 
is associated with the service. SSH’s port forwarding? 
performs user authentication at the service host—but not 
at the service—and hence, to the service the users of a 
host are undifferentiated*. As a result, traditional UBNS 
services use authentication mechanisms such as SSL or 
passwords and OS mechanisms such as setuid which are 
awkward to program and may not be secure. In contrast, 
netAuth both authenticates and authorizes the user on a 
per service basis, so that the service runs only with the 
permission of the user. Unlike SSH, netAuth provides 
end-to-end securing from client to service. 

Distributed Firewalls [5] (based on Keynote [5]) in 
contrast to SSH, implements per user authorization for 
services by adding this semantics to the connect and 
accept APIs. While Distributed Firewalls sit in front 
of the service, and thus are not integrated with the ser- 
vice, Virtual Private Services are integrated and thus can 
provide UBNS services [16]. In DisCFS [27], an inter- 
esting scheme is used to extend the set of users on the 
fly by adding their public keys; although we have not yet 
implemented it, we intend to use this mechanism to al- 
low anonymous access (assuming authorization allows it 
for a service) thus combining the best of authenticated 
and public services. 

Shamon [25, 17] is a distributed access control system 
which runs on Virtual Machines (VMs). It “knits” to- 
gether the access control specifications for different sys- 
tems, and ensures the integrity of the resulting system 
using TPM and attestation techniques. Its communica- 
tion, like netAuth, is implemented in IPsec and uses a 
modified xinetd to perform the authorization. Shamon 
implements a very comprehensive mechanism for autho- 
rization (targeted for very tightly integrated systems), in 
contrast to netAuth’s less complete but simpler service- 
by-service authorization. 

We do not describe the authorization part of netAuth 
in this paper for two reasons. First, there is not suf- 
ficient space. Second, the authentication mechanism 
can be used with any authorization model. For exam- 
ple, even POSIX authorization, privilege separation, and 
VMs could be combined to provide a reasonable base for 
UBNS. The most value for authorization is gained when 
privileges are based both on the executable and the user 
of the process, increasing the value of privilege separa- 
tion. Such separation is essential to allow multiple priv- 
ilege separated services to run on the same OS. Exam- 
ples of such mechanisms include SELinux [34], AppAr- 
mor [9], and KernelSec [30]. Janus[15], MAPBox[2], 
Ostia[14] and systrace[28, 22] are examples of sandbox- 
ing mechanisms which attenuate privileges. 

SANE/Ethane [8, 7] has a novel method of autho- 
rizing traffic in the network. An authorizing controller 
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intercepts traffic and—based on user authentication for 
that host—determines whether to allow or deny the net- 
work flow. This enables errant hosts or routers to be 
isolated. However, the authentication information avail- 
able to Ethane using traditional OS mechanisms is coarse 
grain (it cannot distinguish individual users or appli- 
cations). Ethane and netAuth are complementary ap- 
proaches, which could be combined to provide network- 
based authentication with fine-grained authentication. 

Distributed authentication consist of two components: 
a mechanism to authenticate the remote user and a means 
to change the ownership of a process. Traditionally, 
UNIX performs user authentication in a (user space) 
process and then sets the User ID by calling setuid. 
The process doing setuid needs to run as the supe- 
ruser (administrative mode in Windows) [39]. To reduce 
the dangers of exploits using such highly privileged pro- 
cesses, Compartmented Mode Workstations divided root 
privileges into about 30 separate capabilities [3], includ- 
ing a SETUID capability. These capabilities were also 
adopted by the POSIX le draft standard [1], which was 
widely implemented, including in Linux. 

To limit the setuid privileges further, Plan9 uses an 
even finer grain one-time-use capability [10], which al- 
lows a process owned by Uj, to change its owner to U2. 
NetAuth takes a further step in narrowing this privilege 
since it is limited to a particular connection and is non- 
transferable; but a more important effect is that it is stat- 
ically declared and thus enhances information assurance 
whether manually or automatically performed. 

The traditional mechanism to provide user authenti- 
cation in distributed systems is passwords. Such pass- 
words are subject to dictionary and other types of at- 
tacks, and are regularly compromised. Even mechanisms 
like SSL typically use password based authentication for 
users [12] even though they can support public key en- 
cryption. 

Kerberos [38] performs encryption using private key 
cryptography. Kerberos has a single point of failure if 
the KDC is compromised; private key also means that 
there is no non-repudiation to prove that the user did au- 
thenticate against a server; and requires that the KDC be 
trusted by both parties. Microsoft Window’s primary au- 
thentication mechanism is Kerberos. 

Plan9 uses a separate (privilege-separated) process 
called factotum, to hold authentication information 
and verify authentication. The factotum process asso- 
ciated with the server is required to create the change-of- 
owner capability. But factotum is invoked by the ser- 
vice, and hence can be bypassed allowing unauthenti- 
cated users to access the service. Of course, it is in prin- 
ciple possible to examine the source code for the service 
to determine whether authentication is bypassed, but this 
is an error prone process and must be done anew each 


17th USENIX Security Symposium 229 


230 


time an application is modified. NetAuth, enforces au- 
thentication and authorization which cannot be bypassed 
and is easier to analyze. 

The OKWS web server [21], built on top of the As- 
bestos OS [11] does a per user demultiplex, so that each 
web server process is owned by a single user. This in 
turn is based on HTTP-based connections, in which there 
can be multiple connections per user, tied together via 
cookies. It uses the web-specific mechanism for sharing 
authentication across multiple connections. OKWS was 
an inspiration for netAuth, which allows multiple con- 
nections from a user to go to the same server. NetAuth 
works by unambiguously naming the connection so that 
it works with any TCP/IP connection; and hence is much 
broader than web-based techniques. 


3 System overview 


NetAuth is modular, so that the different implementa- 
tions and algorithms can be used for each of the follow- 
ing three components: 


1. User authentication is triggered by new network 
APIs which (a) transparently perform cryptographic 
(public key) authentication over the network and 
(b) provide OS-based ownership of processes. Part 
(b) inherently requires an authorization mechanism 
which controls the conditions under which the user 
of a process can be changed. 


2. Encrypted communication between authenti- 
cated hosts ensures that confidentiality and in- 
tegrity of communications are maintained, and also 
performs host authentication. This encryption is 
provided by the system and requires no application 
code. 


3. Authorization is used to determine if a process can 
(a) change ownership, (b) authenticate as a client, 
(c) perform network operations to a given address, 
and (d) access files (and other OS objects). In 
UBNS this ideally depends on both the service and 
the user. Thus, the authentication mechanism es- 
sentially labels server processes with the user on 
whose behalf the service is being performed so that 
external authorization can be done effectively. It is 
highly desirable that the authorization system pre- 
vent attacks on one service spilling over to other 
services. 


Due to space limitations, this paper focuses on user 
authentication. Authentication may seem trivial, but it 
requires significant amount of code in applications, so 
much so, that this mechanism is justified solely to im- 
prove authentication (without also improving authoriza- 
tion). Our server implementation is in the Linux kernel, 
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but our client is user-space code which can be ported to 
any OS, including proprietary ones. 

For encryption we require that hosts be authenticated 
and that cryptographic protections be set up transparently 
between hosts. Host authentication is important since if 
the end computer is owned by an attacker then security 
is lost. Such end devices can be highly portable devices 
such as cellphones. (For less important application one 
can use untrusted hosts.) Encryption can be triggered ei- 
ther in the network stack or by a standalone process. Cur- 
rently, we are using IPsec [19, 18] for this purpose as re- 
cent standards for IPsec have made it significantly more 
attractive as it allows for one of the hosts to be NATed 
[40]. But we expect to replace it with a new suite being 
developed which will be far less complex and faster. 

The netAuth API can be used with any authorization 
model, which would need to control both change of own- 
ership and client authentication, perhaps using simple 
configuration files [20] as well as networking and file 
systems to some extent. NetAuth’s authorization model 
controls who may bind, accept, and connect to re- 
mote services on a per user basis as well as fine-grain 
support for the user and services which can access a file. 
NetAuth’s authorization model is fully implemented, and 
we will describe it in a forthcoming paper. 

A central tenet of our design is a clear separation be- 
tween administration and use of our system. Even when 
the same person is performing both roles, this separa- 
tion enables allowed actions to be determined in advance, 
instead of being interrupted in mid-task with authoriza- 
tion questions (e.g., “do you accept this certificate?”’). 
It also supports a model of dedicated system adminis- 
trators; further partitioning of the system administration 
task is possible, for example to allow outsourcing of parts 
of the policy. 

In netAuth, user processes never have access to cryp- 
tographic keys and cryptographic keys can only be used 
in authorized ways. Hence, from the authorization con- 
figuration the system administrator can easily determine 
which users are allowed to use a service and how services 
can interact with each other. 

NetAuth enables successive connections by the same 
user to be directed to a single process dedicated to that 
user. We shall see that this has both programming and 
efficiency advantages. In addition to its uses in tradi- 
tional network services, it can be used to easily set up 
back ends on the same system, and thus allow for further 
opportunities for UBNS. 

We next give an overview of network authentication 
and UBNS mechanisms in netAuth. 


Network authentication netAuth enables the owner of 
a process to be changed upon successful network authen- 
tication. Authentication is implemented as follows: 
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* the server system administrator must enable UBNS 
change-of-ownership by specifying the netAuthen- 
ticate privilege for the service. 


* the client process requests the OS to create a con- 
nection and a time-limited connection-specific dig- 
itally signed authenticator? [31]. 


¢ the server process explicitly requests the OS to per- 
form network authentication. The user authentica- 
tion is only usable by the designated server process 
(it is non-transferable). 


This mechanism requires that the client-side system ad- 
ministrator enable the client to use netAuth authentica- 
tion, and the server-side administrator provide the ne- 
tAuthenticate privilege. As we shall see, application 
code changes to support authentication are trivial on both 
client and server sides. 

Because public key signatures are used for authenti- 
cation, the log containing these signed exchanges proves 
that the client requested user authentication. This prop- 
erty both helps to debug the mechanism and to ensure 
that even the server administrator cannot fake a user au- 
thentication. Lastly, since no passwords are used over the 
network, this scheme is impervious to password guessing 
attacks. 


UBNS _netAuth has a built-in mechanism to support 
UBNS. All connections to a specified service from user 
U;, can be served by a single server process p; unique 
to that user. For users U;, for which there does not ex- 
ist a corresponding process pj, a listening process p pre- 
accepts (see Section 4) the connection and creates a new 
process p;°. Figure 1(a) shows two types of queues of 
unaccepted connections maintained by netAuth (one for 
new users and the other for users for which there exists a 
user process). 

Per user server processes are created on demand for 
efficiency and flexibility. Successive connections for U; 
will reuse server process p;. NetAuth can also support 
other commonly used methods such as pre-forking pro- 
cesses or forking a process per connection. 

This mechanism provides a very clean programming 
model as it is trivial to create back-end services for each 
user on demand. For example, Figure 1(b) shows a cal- 
endar proxy which caches a user’s local and remote cal- 
endars (and no one else’s) and provide feeds to a desk 
planner, email to calendar appointment program, a re- 
minder system, etc. The reminder mechanism might 
know where the user is currently located and where the 
appointment is, so that reminders can be given with suit- 
able lead times. As the user’s connections are always 
to the same process, requests are serialized for that user 
preventing race condition (and the need to synchronize) 
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and enable easy adding of calendar applications without 
configuring for security (since the configuration is in the 
proxy). Such a model also allows different parts of the 
application to execute on different systems. For example, 
a user interface component could run on a notebook, and 
a backend store could run on an always available server. 
We next look at the uses of NetAuth in more detail. 


4 NetAuth Application Programming In- 
terface 


There are several ways to set the owner of a network ser- 
vice: (1) the service can be configured to run as a pseudo 
user (e.g., apache) with enough privileges to satisfy any 
request. (2) the service may need user authentication to 
ensure that it is a valid user (e.g., for mail relay), but 
all users are treated identically. This service too can be 
owned by a pseudo user. (3) the service provided de- 
pends on the user, who therefore must be authenticated— 
itis usually appropriate that the service process be owned 
by its user (i.e., UBNS). 

A UBNS service (a process run under the user’s ID) 
performs the following steps: (a) it accepts a connection, 
(b) performs user authentication to identify the user re- 
questing the service, (c) creates a new process, and (d) 
changes the ownership of the process to the authenticated 
user. Once the ownership of the process is changed to the 
user, it cannot be used by anyone else. 

We next examine how this general paradigm is per- 
formed in Unix and then in netAuth. 

Figure 2(a) shows the call sequence for implementing 
a user authenticated service using UNIX socket APIs. 
The client creates a socket (socket), connects to the 
server (connect), and then does a series of sends and 
receives (Send/recv), and when its done closes the 
socket. 

The server creates a socket (socket), associates it 
with a network address on the server (bind), allocates a 
pending queue of connection requests (listen), waits 
for a new connection request to arrive (accept). To per- 
form UBNS, it spawns a process (fork), and after de- 
termining the user via network messages (not shown) it 
then changes the owner of the process (setuid). At this 
point the newly created service process is operating as 
the user. It communicates back and forth with the client 
and then closes the connection. Since there is typically 
no way to reuse the process after it closes the socket, it 
exits. 

Figure 2(b) shows the equivalent sequencing for ne- 
tAuth. On the client side, the only programming change 
needed to adapt to netAuth is to replace connect with 
connect _by_user (of course, the application-level 
authentication must be removed). 
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Figure 1: Privilege separation in netAuth 
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Figure 2: Sequence of system calls executed by a client and a server. The server forks a process to service a request; 
the forked process is owned by the authenticated user. 
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On the server side, netAuth basically splits the accept 
for a new connection into two phases: 


e The first phase is called the pre_accept, which 
determines when a new user (one that does not have 
a service process) arrives. Hence, the pre_accept 
blocks until there is a waiting connection for some 
user U without a corresponding service process 
owned by U. (To prevent race conditions, a process 
which has a temporary reservation for U by virtue 
of having done a pre_accept but not yet having 
changed the owner is reserved by U.) 


¢ The second phase is the accept_by_user to ac- 
tually accept the connection, after having created a 
process owned by the new user. 


The accept is split into two APIs because there are now 
two actions (1) determining that there is an unaccepted 
connection for a new user (so that a new process can 
be created) and (2) completing the accept by a (child) 
process owned by the new user. (2) ensures that the ac- 
cepted socket can be read or written (since the process is 
owned by the user). Hence, the split accept ensures that 
the accept _by_user only succeeds if the owner of the 
process is the authenticated user on the connection. 

The change of ownership of the process is performed 
by set_net_user. The set_net_user changes the 
owner of the process to the authenticated user and con- 
sumes the netAuthenticate privilege for that pro- 
cess. Thus, set_net_user serves as a highly restricted 
version of setuid, and is far safer to use. 


5 Implementation 


In this section, we describe the netAuth architecture, the 
protocol for user authentication, and the implementation. 
We then describe some performance numbers. 


5.1 Architecture 


The design of netAuth emphasizes the separation of au- 
thentication, authorization, and cryptographic mecha- 
nisms away from the application. 

The overall architecture is shown in Figure 3. Appli- 
cations communicate with each other using APIs which 
emphasize process authentication—the one component 
of netAuth which must be visible to networked applica- 
tion code. There are two types of communications, both 
of which flow over an IPsec tunnel between the hosts: 


¢ the application’s protocol (or data, for performing 
its function) and 


¢ the netAuth authentication information. 
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The authentication information is managed by two ne- 
tAuth daemons—netAuthClient and netAuthServer— 
which perform both the public key operations for user 
authentication and enable the process’ change of owner- 
ship. 


5.2 Authentication protocol 


Because IPsec is used for communication, IPsec per- 
forms host authentication. This means that the remote 
service is authenticated, because the service type is deter- 
mined by port and the IP is verified using IPsec’s public 
key host authentication. 

Before application communication is established, user 
authentication is performed: 


netAuthClient signs an authenticator which describes 
the connection. 


netAuthServer receives the authenticator and verifies 
its signature. 


Public-key cryptographic operations can be considerably 
more expensive than symmetric key algorithms. For- 
tunately, signing (which is done on the relatively idle 
client) takes significantly longer than verifying (on a 
busy server). For example, RSA public key signing times 
(client) and verification times (server) for 1024 and 2048 
bit keys are shown in Table 1’. 

Once the netAuthClient has proved that it can 
sign the authenticator, successive signings prove little 
(since from the first signing we know that the netAu- 
thClient has the requisite private key). Hence, succes- 
sive connects for that user employ a quick authentication 
based on hash chains [23]. 

We use a separate connection to send our authen- 
ticator, rather than the more traditional mechanism of 
piggybacking authentication on the application connec- 
tion. This is done both to increase the flexibility of 
communications and to allow connections to be re- 
authenticated periodically. Re-authentication determines 
whether the user’s account is still active, and hence a 
re-authentication failure disables the user’s account and 
stops their processes, something that is difficult to do 
with other protocols. We re-authenticate using the same 
hash chain scheme as for successive connects for the 
same user. 














key size | signing | verifying 
1024 680 jus 40 pus 
2048 2,780 jus 80 pus 

















Table 1: RSA signing/Verification times in jzseconds 
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Figure 3: Architecture of netAuth 


5.3. kernel-based implementation 


The first NetAuth implementation has been integrated 
into the Linux kernel. Our implementation has three key 
components: 


* kernel extensions (code integrated into the main- 
line kernel) implementing networking support for 
processes with per-user privileges and providing the 
new system calls pre_accept, set_net_user 
and accept_by_user. 


a loadable kernel module implementing netAuth 
authorizations, uses the Linux Security Module 
(LSM) framework [41]. 


The LSM framework segregates the placement of 
hooks (scattered through the Linux kernel) from the 
enforcement of access controls (centralized in an 
LSM module). Thus changes in the mainline ker- 
nel (mostly) do not affect LSM modules. 


e Three user-space daemons which (1) download 
the networking policy into the kernel using the 
netlink facility (2) sign authenticators and (3) 
verify authenticators. 


The kernel implementation currently consists of about 
3,700 lines of C code (~3,000 in the kernel module and 
~700 in the kernel extensions). 


5.4 Performance 


We now report on NetAuth’s performance. All the ex- 
periments were run using a server—an AMD 4200+ (2.2 
GHz) machine with 2GB RAM—and a client—an AMD 
4600+ (2.2 GHz) machine with 1GB RAM. Both com- 
puters ran Linux kernel v2.6.17, used gigabit network- 
ing, and were connected by a crossover cable*. We mea- 
sured elapse times (from the applications) in all cases. 

We performed two types of performance tests to 
measure latency. First, we measured the overhead 
of netAuth authorization and compared it to unmodi- 
fied Linux, for the cases of the bind, connect and 
connect-—send-recv operations. Second, we mea- 
sured latency for netAuth’s per-user services. For the 
second part, there is no comparable Linux scenario and 
hence we report absolute times there. 
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UNIX | NetAuth | Overhead 
(us) (us) 
| bind 6.00 6.75 12.5% 
| connect 28.00 32.00 14.28% 
connect-send- 145.00 157.00 8.27% 
recv (Unix style) 








Table 2: Elapse times for the micro-benchmarks and the 
Unix-style concurrent server (see Section 3). No authen- 
tication is performed in any of these cases. The time 
specified are all in micro seconds. 


5.4.1 System call overhead (no authentication) 


Our first measurements determine the authorization over- 
heads for netAuth, by using a light weight authentication 
with minimal overhead. The authorization mechanism 
limits which users can use the service, it is implemented 
outside of application. 

The measurements are given in Table 2, are of netAuth 
vs. unmodified Linux: 


* the time to perform a bind by a server increased 
by 12.5% due to the overhead of doing the autho- 
rization checks. 


¢ the time to complete a connect (as measured) on 
the client-side increased by 14.28%, due to client- 
side and server-side authorization checks. The 
elapsed time includes a round trip packet time. 


time to do a connect-send-recv 
(as measured) on the client is considered 
next. (The server must do a accept-— 
fork-setuid-recv-send). The send 
and recv are 128 bytes of data. For the UNIX case, 
the total time was about 145 4 seconds while for the 
NetAuth case the time was about 157 ys seconds, an 
overhead of 8.27%. The most costly operation is 
the fork performed at the server to create a new 
per-user process. 


¢ the 


We note that these overheads are best case [26], normally 
latency issues are higher. Moreover, no performance tun- 
ing has yet been done on the netAuth implementation. 


5.4.2. Using netAuth authentication 


This section describes the case of a server in which the 
process for user U; satisfies all of the requests from U;, 
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for a particular service (as described in Section 5). There 
does not exist a comparable scenario in UNIX. Hence, 
we report the latencies observed on the client side in Ta- 
ble 3. 

















Connection netAuth Linux 
(with auth.) | (w/o auth.) 

first 4200 ps 147 ps 

successive 67 ps 147 us 














Table 3: Elapse times observed on the client side to per- 
form a connect-send-recv. The netAuth connec- 
tions are established with user authentication. Successive 
netAuth connections are to the same per-user server pro- 
cess created by the first connection on the server. The 
UNIX connections are established without user authenti- 
cation. 


For the first connection, using netAuth authentication 
mechanisms, a new connection results in the following 
set of actions: (1) on the client, the kernel requests an 
authenticator from the user-space daemon; (2) the client 
generates the authenticator and sends it to the server 
where it is verified; (3) there is a RTT for sending the au- 
thenticator to the server and receiving response from the 
server; (4) there may be context-switch times (between 
client process and authentication daemon); and (5) there 
may scheduling delays. The costliest operation by far is 
the cryptographic signing of the authenticator. 

All subsequent connections on behalf of the same user 
run much faster because they re-use the same server pro- 
cess and fast authentications. In comparison, the elapse 
time for the UNIX case is the same for all cases because 
there are no schemes for a client to re-use a previously 
created per-user process. The values for UNIX shown in 
the Table are without authentication overhead. 


5.4.3 Server throughput 


We next consider server throughput in terms of new con- 
nections. In netAuth, although the first authentication 
must be signed, successive authentications require only a 
very fast cryptographic hash. From table 1, the service- 
based verification of signatures takes only 80yseconds. 
Hence, a single core can perform authentication for 
45,000,000 users per hour assuming authentications are 
cached for one hour. We believe that such performance 
levels eliminate the need to consider weaker authentica- 
tion mechanisms, even for very high volume services. 


5.5 Alternative implementations 


Our first implementation, which is described here, is a 
kernel-level implementation. Of course, we would like 
the APIs described here to be available on other systems 
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without kernel modifications, particularly for those OSes 
for which the source code is not available. 

We consider here only the client side issues as we 
would like netAuth services to be usable from any op- 
erating system. (Server OS, on the other hand, is under 
the control of the service provider.) In section 6.2 we 
describe a proxy implementation which uses netAuth, 
but which could be easily extended into one which im- 
plemented netAuth at the protocol level from user space 
rather than using netAuth APIs. 


6 Porting applications to netAuth 


To show the effectiveness of netAuth we ported a UBNS 
service. We have not yet attempted to port a service 
which is not UBNS organized (such as Apache), as that 
is a far more difficult problem. We chose an applica- 
tion, dovecot, which supports both privilege separation 
and UBNS. 























Process name | executable name | user ID | 
master dovecot root | 
auth dovecot-auth root | 
login imap-login dovecot 
pop3-login dovecot 
imap imap U | 











Table 4: Dovecot processes and their respective user 
ID’s. Here U refers to the user ID of the (remote) user 
whose is accessing her mail. 


Dovecot is an open source IMAP and POP mail server 
(and is included in Linux distributions such as Debian 
and Ubuntu). Users can access dovecot-based services 
remotely using a Mail Viewer Agent (MVA) such as Thun- 
derbird or Outlook. The MVA on the client communicates 
with dovecot using the IMAP or POP protocols over SSL 
or unencrypted connections. 

Dovecot was built with security as a primary goal. 
Since January 2006, its developer has offered an as-yet- 
uncollected reward of 1000€ for the first provable se- 
curity hole?. To support both privilege separation and 
UBNS, dovecot has four process types, running under 
root, dovecot pseudo user, and the user U retrieving her 
mail, as shown in Table 4. 

Table 5, shows the code organization of the dovecot 
distribution supporting IMAP (v1.0.9)!°. Dovecot also 
uses pam, crypto, and ssl libraries which are not included 
in these line counts. The source distribution to support 
IMAP is 24,628 lines of code, of which 9,307!! (37.8%) 
are associated with authentication and encryption. The 
port consisted of removing this code, and copying over 
less than 1,000 lines from master (configuration and 
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the concurrent server loop) and login (the initial hand- 
shake code) to imap. 

The port reduces the number of process types from 
four to one. With a traditional Unix authorization model, 
the port still requires root to bind to port 143 and to do 
setuid; but unlike the pre-port version, our imap process 
never reads user input while running as root and thus is 
not subject as root to buffer overflow attack. (The privi- 
leges can be still reduced further using netAuth’s autho- 
rization model). 

When implementing a imap service from scratch, 
only 4 netAuth specific lines would be needed to pro- 
vide authentication and encryption over that required for 
an unauthenticated service. 


6.1 Dovecot before and after 


The standard version of dovecot is more complex be- 
cause of the privilege separation mechanism and espe- 
cially the complexity of using standard authentication 
and cryptographic mechanisms. We describe first the 
processes and then later the operations needed to retrieve 
IMAP mail in standard dovecot. 


6.1.1 Standard dovecot 


The dovecot distribution is composed of the following 
processes: 


master process starts the auth process and n (by de- 
fault, 3) Login processes. The master process is 
also responsible for the creation of an imap process 
after a successful authentication. 


auth process authenticates new users for the login 
process (over a UNIX socket). The auth pro- 
cess also verifies successful authentications to the 
master before it creates a mail process. 


login process listens on the appropriate port (e.g., 143 
for IMAP) for new connections. Once a connec- 
tion is established it negotiates with the MVA pro- 
cess to initialize the connection (sending server ca- 
pabilities, setting up SSL, etc.) and requests authen- 
tication of the user. Upon successful authentication, 
the login process requests the master process to 
create anew imap process and then exits. 


imap process receives the socket descriptor over a 
UNIX socket from the login process. The imap 
process then communicates with the remote MVA to 
access the user’s mailbox on the server. 


Figure 4 shows the sequence of events that are nec- 
essary to create a new imap process to service requests 
from the MVA. 
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1. Messages la and 1b establish the initial connection 
between the MVA and dovecot. During this step, 
the MVA requests and receives the server capabili- 
ties (not shown in the figure). 


2. authentication step (shown as messages 2a-2e and 
action 2f). (a) The MVA sends the user’s authentica- 
tion information as part of a LOGIN message. (b) 
In response, the login process requests the auth 
process to authenticate the user (c) The login 
process request the auth process to authenticate, 
(d) on successful authentication the login process 
sends a response back to the MVA and (e) requests 
the master process to create a new imap pro- 
cess. (f) the master, after verifying a successful 
authentication with the auth process, creates the 
new mail process running on behalf of user U;,. 


3. The imap process then services the MVA’s future 
requests. 


6.1.2 Porting dovecot to netAuth 


The porting of dovecot to netAuth consists of (a) remov- 
ing code, (b) moving some code into the imap process 
and (c) removing three of the four processes. A dove- 
cot process ported to netAuth is not expected to per- 
form the following functions: message encryption us- 
ing OpenSSL, GNU-TLS or the like; user authentication; 
performing the complex setuid() operation and re- 
lated code to ensure that the process does not have any 
privileged left-overs (in the form of file descriptors) in 
the unprivileged process. Hence, code for these secu- 
rity sensitive operations need not be implemented by the 
dovecot executable and can be removed. Thus, summa- 
rizing the dovecot port to netAuth: 


¢ the auth process (and its code) is eliminated com- 
pletely as the user authentication is performed by 
the OS as part of connection establishment. 


¢ from the master process, only the code to bind to 
the privileged port and to configure a new mail pro- 
cess with the appropriate set of environment vari- 
ables is retained. (Dovecot passes configuration in- 
formation to the imap process as environment vari- 
ables) 


¢ from the Login process, only the initialization of a 
new connection (la and 1b in Figure 4) is retained. 


¢ the core functionality of accessing and maintaining 
mailboxes in the imap process is retained. 


Thus, the dovecot port to netAuth runs as a single pro- 
cess type (following the design for a concurrent server 
implementation shown in Figure 2). The master, auth 
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directory total lines 

of code 
master 2,460 
auth 5,469 
imap-login 484 process | dovecot’s libraries | total lines | dynamic 
imap 3,456 used of code libraries 
lib-auth 490 master | lib 8,728 
lib 6,268 auth lib, lib-settings, 13,024 pam 
login-common 1,138 lib-ntlm, lib-sql crypto 
lib-imap 1,069 login lib, login-common, 9,449 ssl 
lib-settings 101 lib-auth, lib-imap crypt 
lib-ntlm 304 imap lib, lib-dict, lib-mail, | 13,300 ssl 
lib-sql 882 lib-imap, lib-storage 
lib-dict 470 
lib-storage 574 
lib-mail 1,463 
total 24,628 








Table 5: Table with lines of code in the various directories in dovecot. The command ‘cat *«.c *.h | grep 
".™ | we -l’ was used to determine this count. 


Client Server 







Dit fork, 


setuid a7 






2a:LOGIN U;, 


3:access mailbox 


Figure 4: The processes that comprise standard dovecot and their interaction to authenticate a user. Solid arrows 
indicate message exchange while dashed ones represent process actions. Message exchange across system boundaries 
use a network socket while those within the same system use UNIX sockets. 
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2:access mailbox 





Figure 5: The message exchanges between the ported netAuth dovecot and the MVA. 


and login processes are eliminated after taking a small 
amount code from them. 

The resulting imap code performs the following 
steps: 


¢ initializes a socket to listen for new connections. It 
performs a bind on the privileged port, a listen, 
sets the accept mode to acceptByUser and blocks on 
pre_accept waiting for a connection from a user 
for which there is no imap process. 


¢ when a connection from a new user arrives, the pro- 
cess returns from pre_accept with the new user’s 
information. The process forks a child process to 
handle the user and returns to waiting for a new user. 


e the child process changes the user by executing 
set_net_user with the user information from the 
pre_accept call. The child process runs as the 
new user. This process can now accept the connec- 
tions (for that user) and process the MVA’s requests. 


The user is authenticated as part of the processing 
in the network stack to accept a connection Hence, 
pre_accept returns only for authenticated users. Con- 
nection requests of users that fail to successfully authen- 
tication are dropped (with a RST sent back). 


6.2 Client side modifications 


To test out the server-side modifications it was necessary 
to produce a netAuth-enabled MVA. Rather than port an 
existing MVA, such as Thunderbird, we instead built a 
netAuth proxy. This has several advantages, including 
portability to systems which do not allow kernel modi- 
fications and ability to support a wide variety of MVAs 
without doing multiple ports. The proxy presented the 
least invasive approach. 

The proxy binds to the IMAP (or POP) port on the 
localhost. The events to setup a new connection: 
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* proxy binds to the privileged IMAP port and waits 
blocking for connection request. 


* when a new connection request comes in from the 
MVA, the proxy authenticates the MVA. Once authen- 
ticated, the proxy initiates communication with the 
dovecot server using the connect_by_user sys- 
tem call. 


¢ Once connected, the proxy just forwards messages 
to and from the MVA. 


Multiple dovecot servers It is not unusual for a user 
to have multiple mailboxes maintained at more than one 
server. In this case, the proxy maintains a system-wide 
mapping (common to all users) from non-routable local 
IP addresses in the range 127.0.0.0/8 to the well-known 
routable IP address of the remote host running the dove- 
cot server. All the MVA’s on the client are then configured 
to use IP addresses in this range (published by the proxy) 
to refer their respective hosts. 

The proxy binds and listens for connection requests on 
all the published local interfaces (i.e., all the 127.0.0.0/8 
IP addresses configured for the proxy). A request on a 
given IP address corresponds to a particular remote host 
(known to the proxy). The proxy can then follow the 
scheme outlined above to authenticate the user and es- 
tablish the connection. 


7 Security achieved 


The user never has access to his private keys, and in fact 
needs permission to authenticate using the private keys. 
This mechanism can be expanded to allow different pri- 
vate keys for different uses, although we do not yet sup- 
port that. One use of such a facility is to allow the user to 
perform personal chores, such as banking with one key 
and to perform business functions with another key. 
Only the specified users can connect to the service, 
since they must be authorized. This authorization is in- 
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dependent of a service; if the service is designated as 
an authenticated and authorized service, there is nothing 
the service can do (either deliberately or accidentally) to 
evade this mechanisms. The process may avoid actually 
setting the user ID, but the mechanism pairs user authen- 
tication with the connection, so that it can only be used 
by the process which accepts that connection and it is 
necessary to authenticate before reading or writing to the 
connection. 

Because the authorization and authentication of user 
services are totally declarative, it is possible to automat- 
ically analyze them. (In contrast, this is not possible in 
general, due to decidability problems, when these func- 
tions are performed by application code.) We are plan- 
ning on extending our previous work in DAC and MAC 
access controls to automatically analyze authorization 
properties across computer systems [36, 35, 37]. 


8 Conclusion 


UBNS requires a mechanism for (1) authentication of 
users over the network and (2) allowing server processes 
to change the user on whose behalf they execute. Imple- 
menting the cryptographic mechanisms for user authen- 
tication as part of the application is complex and error 
prone, and as we showed, requires a substantial amount 
of code. Moving the authentication and cryptographic 
mechanisms outside the application makes the applica- 
tion independent of these mechanism, and application 
programmers are usually not skilled in this area. More- 
over, the OS mechanisms for change of process owner- 
ship are also dangerous as such privileges are among the 
strongest in a computer system, since changing a user 
typically allows the privileges of any user to be appropri- 
ated. 

And hence, programmers typically defer such consid- 
erations, ignoring them during initial design. But UBNS 
affects the very structure of programs and when its con- 
sideration is delayed, it becomes increasingly expensive 
to retrofit. Thus many applications will not be structured 
as UBNS and the design will not satisfy the property of 
least privilege. 

NetAuth is a simple mechanism to invoke network au- 
thentication and process change-of-ownership, thus en- 
couraging the design of UBNS. It builds on the work of 
Kerberos, SSH, and Plan9 but seeks to do so with the 
style of mandatory access controls and to provide better 
information assurance. It 


¢ Requires only four lines of code for authenticated 
and cryptographically protected communications 
vs. a (concurrent) service which neither authenti- 
cates nor encrypts traffic. 
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¢ Enables the application developer to focus on the 
key task of partitioning the application into pro- 
cesses early in the design process. 


¢ Remove the need for privileged processes to receive 
external input, and thus guards against a range of 
attacks including buffer overflow. 


¢ Makes application code independent of the authen- 
tication method, thus enabling changes in the au- 
thentication methods without affecting either source 
or binary code. 


e Externalizes authorization, making it independent 
of application failures. 


While the authentication mechanism and APIs described 
here can be used with any authorization model, we have 
also built an authorization model (to be described else- 
where) which has a highly analyzable configuration in 
which strong properties can be understood independently 
of the application code. 

NetAuth integrates public key and a fast re- 
authentication mechanism to achieve high performance 
authentications with the strongest possible properties. 
Further increases in performance are enabled by the re- 
use of processes for the same user, saving system over- 
head. This simplifies the structure of such applications, 
and makes it much easier to build UBNS. Such an easy- 
to-use mechanism will encourage programmers to inte- 
grate security from the start, and thus construct more se- 
cure applications. 

Not only do these mechanisms enable the construction 
of more secure services but also provide significant ad- 
vantages for system administration. These mechanisms 
enable strong controls to be imposed on services without 
resorting to application specific configuration and with- 
out analyzing application code. 
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Notes 


Inttp://www.securityfocus.com/infocus/1876 
2nttp://www.dovecot.org/security.html 
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3 Alternatively, SSH allows a remote executable to be invoked, but 
that remote executable is not connected to as a network service. 

4hg-login http: //www.selenic.com/mercurial/wiki/ 
index.cgi/SharedSSH, as used in Mercurial, performs remote 
authentication using SSH, but execs a new program rather than con- 
nect to a running network service. 

5 We are using a simplified, and easily customized, certificate rather 
than the complex X.509 certificates. 

The application code forks the new process pj. This explicit struc- 
ture allows also non-privilege-separated iterative and concurrent ser- 
vice, although these exist largely for legacy applications. 

7Source http://www.cryptopp.com/ 
benchmarks-—amd64.html, for an AMD Opeteron 2.4 GHz 
processor 

8The server has an nVidia 570 chipset and the client an nVidia 430 
chipset. They both run the open source forcedeth driver. 

°The webpage at http://www.dovecot.org/security.html displays a 
list of security holes found in dovecot since the announcement of the 
award. The dovecot developer (maintainer of the webpage) claims that 
these holes cannot be exploited under reasonable circumstance stated 
as a set of rules on the same page. 

OD ovecot also supports POP, which we ignore for this comparison. 
11Code from the directories: auth, imap-login, login-common, lib- 
auth and master (except the configuration code). 
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Abstract 


Hypervisors have been proposed as a security tool 
to defend against malware that subverts the OS kernel. 
However, hypervisors must deal with the semantic gap 
between the low-level information available to them and 
the high-level OS abstractions they need for analysis. 
To bridge this gap, systems have proposed making as- 
sumptions derived from the kernel source code or sym- 
bol information. Unfortunately, this information is non- 
binding — rootkits are not bound to uphold these assump- 
tions and can escape detection by breaking them. 

In this paper, we introduce Patagonix, a hypervisor- 
based system that detects and identifies covertly execut- 
ing binaries without making assumptions about the OS 
kernel. Instead, Patagonix depends only on the proces- 
sor hardware to detect code execution and on the binary 
format specifications of executables to identify code and 
verify code modifications. With this, Patagonix can pro- 
vide trustworthy information about the binaries running 
on a system, as well as detect when a rootkit is hiding or 
tampering with executing code. 

We have implemented a Patagonix prototype on the 
Xen 3.0.3 hypervisor. Because Patagonix makes no as- 
sumptions about the OS kernel, it can identify code from 
application and kernel binaries on both Linux and Win- 
dows XP. Patagonix introduces less than 3% overhead on 
most applications. 


1 Introduction 


Malicious software, otherwise known as malware, con- 
tinues to be a serious problem in today’s computing en- 
vironment. Malware is becoming increasingly difficult to 
detect and remove because it commonly comes bundled 
with a rootkit [12], which abuses administrative privi- 
leges to hide the execution of malware binaries and their 
resource usage from the system administrator. Rootkits 
accomplish this by attacking the administrator’s ability to 
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obtain information about a system. For example, rootkits 
will subvert execution-reporting utilities, such as ps and 
1lsmod on Linux systems and the task manager and 
Process Explorer [27] on Windows, which admin- 
istrators rely on to query the operating system (OS) about 
running binaries and kernel modules. Rootkits may also 
subvert the OS kernel itself so that any queries to the ker- 
nel will receive a response that has been appropriately 
distorted by the rootkit. In this way, rootkits have been 
able to elude even the most experienced system admin- 
istrators and sophisticated malware detection tools [11]. 
Even if the rootkit’s presence is discovered, it is difficult 
to determine whether an attempted removal is success- 
ful or not, as the rootkit’s ability to hide executing code 
enables it to trick the administrator into believing that it 
has been removed. As a result, best practice states that 
when a rootkit is even suspected to be present, the ad- 
ministrator must re-install the entire system from scratch 
to be sure that the rootkit is removed — a costly and un- 
desirable solution. Trustworthy execution-reporting util- 
ities, which would enable a system to detect hidden mal- 
ware processes and determine if an attempted removal 
was successful or not, would save administrators a great 
deal of effort and reduce system downtime. 





In this paper, we present Patagonix, a system that de- 
nies rootkits the ability to hide executing binaries from 
the system administrator. Patagonix does this by address- 
ing two shortcomings of current execution-reporting util- 
ities. First, these utilities all depend on the integrity of 
the kernel, both as a source of information and for protec- 
tion against tampering. However, since rootkits can sub- 
vert the kernel, the trust that these utilities and the admin- 
istrator invest in the kernel is misplaced. Second, these 
utilities do not verify the integrity of the binaries they re- 
port as executing. This shortcoming allows a rootkit to 
covertly execute code by injecting malicious code into 
a running binary or by tampering with the binary image 
on disk. Utilities that monitor binaries on disk, such as 
Tripwire [17], may detect tampering of on disk binaries, 
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but will miss tampering of binaries once they are loaded 
in memory. 

Unlike existing execution-reporting utilities, Patago- 
nix does not depend on the OS. Instead, Patagonix uses 
a hypervisor, allowing it to retain its integrity even if the 
rootkit has compromised the OS kernel. The challenge to 
implementing an execution-reporting utility in a hypervi- 
sor is the semantic gap [6] between the information avail- 
able to the hypervisor and the actual state of the system. 
Other work has bridged this gap by using and trusting in- 
formation about the OS kernel, such as the kernel source 
code or kernel symbol information [3, 10, 13, 23, 25]. 
However, such information cannot be trusted because it 
is non-binding — the rootkit is not bound to maintain the 
semantics implied by source and symbol information, al- 
lowing it to escape detection. For example, if the hyper- 
visor uses non-binding information about the format or 
location of kernel data structures, the rootkit may evade 
detection by adding fields to the data structures or mov- 
ing the data structures to a memory location that is not 
being monitored. Similarly, assumptions about the code 
structure of the kernel can be exploited by a rootkit that 
modifies OS kernel execution to avoid code paths moni- 
tored by the hypervisor. Patagonix does not rely on any 
non-binding information about the OS kernel and relies 
only on the behavior of the hardware, which cannot be 
altered by malware. 

Patagonix also verifies the integrity of all executing 
binaries before giving their identity to the administrator. 
Rather than verifying the contents of binaries on disk, Pa- 
tagonix inspects the code as it executes in memory. As a 
result, Patagonix cannot be fooled by rootkits that avoid 
tampering with files on disk by injecting malicious code 
into binaries as they run. On the other hand, systems 
make modifications to code at run-time, causing it to dif- 
fer from its image on disk when it is executed. Patagonix 
can differentiate legitimate modifications from malicious 
ones. The executing code is identified using a trusted ex- 
ternal database that contains cryptographic hashes of bi- 
naries, such as the National Software Reference Library 
(NSRL) [20]. 

In this paper we make three main contributions: 


¢ Patagonix Prototype. We have implemented a Pa- 
tagonix prototype that leverages the capabilities of 
a hypervisor and the non-executable (NX) bit of the 
Memory Management Unit (MMU) to detect and 
identify all executing binaries regardless of the state 
of the OS kernel. Our prototype, built on the Xen 
3.0.3 hypervisor [4], makes no assumptions about 
the OS kernel. As a result, with the exception of 
the binary format information, which differs from 
OS to OS, it can be used to neutralize rootkits on 
Windows XP, Linux 2.4 and Linux 2.6 OSs without 
modification. 
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* Identity Oracles. The semantic gap between the 
hypervisor and the OS requires special support to 
differentiate legitimate modifications made to run- 
ning code by the OS from malicious ones made 
by a rootkit. To differentiate legitimate modifica- 
tions from malicious tampering, we introduce the 
concept of an identity oracle, which when given a 
page of code in memory and a database of binaries, 
will either identify the binary from which the code 
page originated, or indicate that the code page is not 
from any of the binaries in the database. We have 
designed an oracle construction framework and im- 
plemented identity oracles for ELF binaries, PE bi- 
naries, the Linux kernel, the Windows XP kernel, 
and Windows driver interrupt handlers. 


e System Usage and Evaluation. We present two 
complementary usage modes for Patagonix. In re- 
porting mode, Patagonix serves as a trusted replace- 
ment for the standard execution-reporting utilities 
of an OS, allowing the administrator to see all exe- 
cuting processes even if hidden by a rootkit. This 
augments the administrator’s ability to audit the 
state of the system during regular inspections and 
after an attempted rootkit removal. In lie detection 
mode, Patagonix compares the executing binaries 
reported by the OS with the executing binaries it 
identifies and reports any discrepancies to the ad- 
ministrator [10]. We tested Patagonix on 9 rootkits 
and found that it was able to identify code hidden by 
every one of them. In addition, our Patagonix proto- 
type introduces less than 3% performance overhead 
on most applications. 


We do not claim that Patagonix can detect all rootkits 
since Patagonix focuses on detecting covertly executing 
binaries — a rootkit that does not hide executing binaries, 
but only hides files and network connections, would not 
be detected. Fortunately, techniques to detect such rootk- 
its, which do not depend on non-binding information, al- 
ready exist. For example, using direct access to a raw 
disk image can detect hidden files [13] and a network- 
based intrusion detection system can detect hidden net- 
work connections. However, to the best of our knowl- 
edge, all techniques to detect hidden processes depend 
on non-binding information, making Patagonix useful in 
those circumstances. 

In Section 2, we describe the problem with trusting 
non-binding information, the assumptions that Patago- 
nix relies on, and the guarantees and limitations it has. 
Section 3 gives an overview of the Patagonix architec- 
ture, while Sections 4 and 5 detail our identity oracles 
and our prototype implementation. In Section 6 we de- 
scribe the two usage modes of Patagonix: reporting and 
lie detection. Section 7 evaluates Patagonix’s effective- 
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ness at detecting covert processes and performance over- 
head. Section 8 discusses related work and we conclude 
in Section 9. 


2 Security Model 


2.1 Problem Description 


Systems that monitor OS-level events from a hypervi- 
sor must wrestle with the semantic gap between the state 
of the OS and the information available to the hypervi- 
sor. Previous systems have bridged this gap using non- 
binding information derived from source code and sym- 
bol information, but acknowledge that in doing so they 
make themselves vulnerable to a rootkit that is aware 
of their monitoring technique [3, 10, 13,23, 25]. For in- 
stance, if the hypervisor monitors the system call table 
by using location information derived from non-binding 
sources, the rootkit can evade detection by altering the 
kernel’s system call dispatch handler to use a table placed 
at a different location, and filled with pointers to mali- 
cious system call handlers. The hypervisor-based mon- 
itor would continue to monitor the original, unchanged 
system call table, which is no longer being used by the 
kernel. Unfortunately, preventing this attack by sim- 
ply disallowing modification of kernel code will cause 
false positives because kernels employ self-modifying 
code. Manipulating the dispatch handler is only one 
example; similar assumptions based on non-binding in- 
formation about data types or function entry-points are 
equally prone to subversion. More sophisticated tech- 
niques take a systematic approach to analyzing the Linux 
kernel memory state for tampering by malware, but they 
require ad hoc rules written with expert knowledge [24] 
or source code annotations that provide only partial pro- 
tection [25]. Further, all the aforementioned approaches 
use a sampling approach, creating a window of vulnera- 
bility that may be exploited by malware to remain unde- 
tected. 

Patagonix securely addresses the semantic gap prob- 
lem by avoiding reliance on non-binding information. 
Rather it relies only on information from the proces- 
sor hardware about pages containing executing code. In 
addition, Patagonix detects and validates run-time code 
modification and ensures that they conform to the modi- 
fications permitted in the binary format specification. Fi- 
nally, by utilizing the processor MMU hardware, Patago- 
nix provides continuous monitoring and detection with 
very little overhead. 


2.2 Assumptions and Guarantees 


To provide security guarantees, Patagonix relies on two 
properties of the hypervisor. First, Patagonix assumes 
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that the hypervisor will protect both itself and Patagonix 
from tampering by a rootkit that has subverted the OS 
kernel. This assumption is consistent with the guaran- 
tees that hypervisors aim to provide. Second, Patagonix 
relies on the hypervisor to provide a secure communi- 
cation channel between it and the user. Patagonix uses 
this channel to inform the user of what binaries it detects 
are running. Because the hypervisor is the only principal 
with direct access to the hardware, this channel can be 
provided in a straightforward way by providing separate 
consoles for the OS and Patagonix. 


Patagonix identifies executing binaries by the crypto- 
graphic hash of the executing code. To convey this infor- 
mation to the administrator in a useful way, these hashes 
must be mapped to the name of a file or application. Ex- 
tracting this mapping from the disk image is not trust- 
worthy since a rootkit can tamper with the disk. Instead, 
Patagonix relies on a trusted database to provide such a 
mapping. This database is assumed to contain the names 
of all legitimate software binaries that the administrator 
has installed on the machine and can also optionally con- 
tain mappings of known malicious binaries. Any exe- 
cuting binary that does not match one in the database is 
identified as “not present” and should be scrutinized by 
the administrator. Publicly available databases currently 
exist — for example, our prototype uses the NSRL [20]. 
We note that the labeling of binaries as legitimate or ma- 
licious is made available purely for the convenience of 
the administrator and is not used by Patagonix. His- 
tory has shown that such labeling may be flawed — there 
have been many documented cases of trojaned, vulner- 
able, or patently malicious binaries being distributed by 
reputable entities [11]. Patagonix correctly handles situ- 
ations where malware is executing on the OS because it 
was incorrectly labeled as legitimate in the database. For 
example, Patagonix can be used to confirm that the in- 
correctly labeled application is no longer executing after 
an attempted removal. 


Even with malware in control of the OS, Patagonix 
guarantees that it is able to identify and report all execut- 
ing binaries. Rootkits may try to hide malware binaries 
from the administrator by either appropriating the name 
of a legitimate application, or by trying to make it invis- 
ible. Patagonix prevents the former by using mappings 
from the trusted database. This also defeats any attempts 
to inject malicious code into legitimate binaries on disk 
or in memory since this will alter the contents of the code 
when it executes. If the rootkit tries to hide the execution 
of a binary by subverting the OS kernel or execution- 
reporting utilities, Patagonix will still identify and report 
the executing binary to the administrator since Patagonix 
monitors the processor hardware for executing code, not 
the OS kernel. With these guarantees, Patagonix can re- 
port the identities of all executing binaries to the user in 
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reporting mode. Correspondingly, in lie detection mode, 
it can notify the administrator of any discrepancies be- 
tween the code it detects and that reported by the OS. 


2.3 Limitations 


The goal of Patagonix is to provide a trustworthy alter- 
native to traditional OS execution-reporting utilities, thus 
denying rootkits the ability to hide executing binaries 
from the administrator. However, detecting and prevent- 
ing the exploitation of vulnerabilities is outside the scope 
of Patagonix. For example, Patagonix does not detect at- 
tacks that do not inject new code, but instead alter the 
control flow of an application, such as in a return-to-libce 
attack [32]. More generally, neither Patagonix nor tradi- 
tional execution-reporting utilities prevent legitimate ap- 
plications from taking malicious actions as a result of 
malicious inputs. For example, the attacker can cause a 
legitimate interpreter or a just-in-time (JIT) compiler to 
perform malicious actions by using it to run a malicious 
script. Despite this, Patagonix provides strong and use- 
ful guarantees. While Patagonix cannot tell if a script is 
malicious or not, it guarantees that the administrator will 
be aware of all executing interpreters and JITs. 

Identifying and verifying the integrity of interpreters is 
the same as other binaries because all the machine level 
instructions that can be executed by the interpreter are 
known a priori. However, this is not the case for JITs be- 
cause they dynamically generate and execute code whose 
content can be heavily dependent on the workload and 
run-time state. Thus, once Patagonix identifies a pro- 
gram as a JIT, it will ignore pages it observes executing 
in the JIT address space that are not present in the trusted 
database (JITs must always execute code from their bi- 
nary before any dynamically generated code, so Patago- 
nix is always able to identify the process first). While a 
rootkit may exploit this to inject arbitrary code into the 
JIT and escape any sandboxing enforced by the JIT, Pa- 
tagonix’s guarantees still hold because the rootkit will 
not be able to hide the execution of the JIT, nor can the 
rootkit cause Patagonix to misidentify the JIT as another 
application. 

Finally, as mentioned earlier, Patagonix used in lie de- 
tection mode is not a generic rootkit detector: it focuses 
on rootkits that hide executing binaries. 


3 System Architecture 


3.1 Overview 


The architecture of Patagonix is illustrated in Figure 1. 
The majority of Patagonix is implemented in the Pata- 
gonix VM, while a small amount of functionality that 
requires kernel mode privileges is implemented in the 
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hypervisor. The Monitored VM contains the Monitored 
OS for which the administrator wants trustworthy binary 
execution information and the hypervisor protects Pata- 
gonix from tampering by the monitored VM. While im- 
plementing Patagonix entirely within the hypervisor may 
reduce performance overhead, splitting the functionality 
of Patagonix into hypervisor and VM components has 
the benefits of increased modularity, ease of portability 
to a different hypervisor, and a reduction on the size of 
the code being added to the security critical hypervisor. 
As we shall see in Section 7, the boundary crossings be- 
tween the hypervisor and VM components of Patagonix 
have a minimal impact on overall performance. 

The Patagonix VM contains three components. First, 
several identity oracles, one for each type of binary in 
the monitored VM, enable Patagonix to identify pages of 
code that are executed in the monitored VM. The iden- 
tity oracles use cryptographic hashes of binaries from the 
trusted database to identify binaries executing in the Pa- 
tagonix VM. Second, a management console implements 
the interface between the user and Patagonix. Finally, 
the control logic coordinates events between the manage- 
ment console, the oracles and the hypervisor component 
of Patagonix. 

Only the identity oracles are OS-specific as one must 
be written for every binary format used by the OS in the 
monitored VM. All other components, which we collec- 
tively refer to as the Patagonix Framework, are OS ag- 
nostic. 


3.2 Patagonix Framework 


The Patagonix framework has three main responsibili- 
ties. First, the framework must detect when code is be- 
ing executed in the monitored VM. Second, when code 
execution is detected, it invokes the identity oracles to 
identify the code and maintain a list of executing code. 
The identity oracles will either match the executing code 
to an entry in the trusted database, or will indicate that 
the identity of the code is not present in the database. Fi- 
nally, the framework is responsible for conveying these 
results to the user in a way that is free of tampering by 
malware in the monitored VM. 

Detecting code execution is performed by the Pata- 
gonix hypervisor component using the non-executable 
(NX) page table bit, which is available on all recent 
AMD and Intel x86 processors. When set on a virtual 
page, this bit causes the processor to trap into the hyper- 
visor component whenever code is executed on that page. 
The hypervisor component then informs the control logic 
in the Patagonix VM by sending it a virtual interrupt. 

Frequent traps into the hypervisor will hurt perfor- 
mance so Patagonix uses the processor to only inform it 
when either code is executed for the first time, or code it 
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Figure 1: The Patagonix architecture. 


has already identified changes and is executed. To iden- 
tify code when it executes for the first time, the hypervi- 
sor component initially sets the NX-bit on all pages in the 
monitored VM so that it will receive a trap from the pro- 
cessor when a code page is executed. When it receives 
such a trap, the hypervisor component invokes the Pata- 
gonix VM to identify the code and then clears the NX- 
bit on the page, making it executable. At the same time, 
to detect if the identified code is subsequently modified, 
the hypervisor component makes the page read-only by 
clearing the writable bit in the page table. As long as the 
page remains unchanged, subsequent executions of code 
on that page do not cause a trap. If the identified code 
is modified, the processor will trap into the hypervisor, 
at which time the hypervisor component will make the 
page writable but non-executable again. If the modified 
code is executed, the hypervisor component will again 
receive a trap, at which point it will use the Patagonix 
VM to re-identify the code. To eliminate the possibility 
of a race where the rootkit alters the code page after it 
is identified, but before it is made executable, the mon- 
itored VM is paused while the Patagonix VM identifies 
the executing code. Setting executable or writable priv- 
ileges on entire pages at a time is fairly straightforward. 
However, pages that contain mutable data and code re- 
quire the ability to prevent writes to the code portions of 
the page and execution for the data portions of the page. 
While this can be implemented with additional hardware, 
we have been able to emulate such support in software. 
We defer the details of the solution to Section 5.2. 


To identify code in memory, the identity oracles re- 
quire the contents of the code page being executed, the 
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virtual address at which the page is located, and the pro- 
cess the code comes from. The control logic retrieves 
this information via new hypercalls, which are hypervi- 
sor analogs of OS system calls we have added to Xen. 
The control logic then passes this information to each of 
the identity oracles, which either return the identity of the 
binary from which the code originated, or indicate that 
the identity of the originating binary is not in the trusted 
database. We note that Patagonix does not use OS pro- 
cess IDs to identify processes as these are controlled by 
the OS and can be subverted by a rootkit. Instead, Pa- 
tagonix identifies a process by its virtual address space, 
which is an equivalent hardware proxy since by defini- 
tion there is a one-to-one relationship between OS pro- 
cesses and address spaces. A process’ address space is 
denoted by the base address of its page table hierarchy, 
which is maintained in a dedicated register on x86 pro- 
cessors. 


Because the hardware only reports when code is exe- 
cuting, rather than when it is not going to be executed any 
more, the control logic records the most recent time it ob- 
served each binary execution and periodically instructs 
the hypervisor to perform a refresh, i.e., set all pages as 
non-executable. Code that is no longer executing will 
not trigger any more traps. Patagonix does not infer pro- 
cess termination by observing when a page table does not 
contain any valid mappings like Antfarm [14] because 
malware that controls the OS can toggle the page table 
bits between valid and invalid without actually removing 
the process from memory, thus circumventing this pro- 
cess termination heuristic. 


The control logic uses the management console to se- 
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curely report the list of observed executing binaries and 
times they were last observed executing. Because the hy- 
pervisor has control over the hardware, it is able to pro- 
vide the management console in the Patagonix VM with 
an interface separate from that of the monitored VM, thus 
ensuring that the monitored VM cannot tamper with the 
output of the Patagonix VM. 


3.3. Identity Oracles 


Executable binaries are mapped from disk into memory 
by a binary loader, whose behavior is governed by the bi- 
nary format that it loads. The task of the identity oracles 
is to use the information provided to them to reverse the 
transformations that the loader applies to binaries, and 
identify which binary in the trusted database (if any) the 
page of code being executed originates from. 

Aside from the information provided to the oracles by 
the hypervisor component, the oracles also require infor- 
mation about the binaries in the database they are try- 
ing to match against. For example, information such as 
hashes of each individual code page in the file and in- 
formation about relocations are required depending on 
the particular format of the binary. While current binary 
databases generally only contain hashes of binary files, 
additional information can be extracted from files on 
disk after they have been authenticated using the trusted 
database. Each oracle initially collects such information 
by searching the disk of the monitored VM for all exe- 
cutable binaries. The authenticity of an executable file is 
verified when its hash is found in the database, and the 
oracle can then proceed to extract additional information 
from the file. Patagonix needs to rescan the disk each 
time binaries are added, or alternatively, a program in the 
OS can be used to gather information about new binaries 
as they are introduced into the system. If an executable 
file is hidden from Patagonix by a rootkit, Patagonix will 
not have the necessary information to identify executing 
code from this binary and thus will not be able to match 
code originating from these binaries against entries in the 
database. As a result, such code will be identified as “not 
present”, thereby indicating to the administrator that a 
rootkit is likely on the system. In either case, access 
to the trusted database itself must be free of tampering 
by the rootkit. We implement our prototype database by 
combining hashes from the NSRL database, hashes from 
signed RPM packages and hashes computed from pris- 
tine binaries directly into the Patagonix VM image. Had 
the database been maintained remotely, it would need to 
be accessed over a secure, authenticated channel such as 
one offered by SSL. 

Once the information about the binaries is acquired, 
the main challenge for the oracles is to reverse the trans- 
formations done by the loader without trusting informa- 
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tion from the OS. Formally, each binary loader can be 
modeled as a function L(B,S) = (M, A), which maps 
a particular binary B, and the state of the OS at the 
binary load-time S',, to a set of memory pages M and 
a set of addresses A. M denotes the set of possible 
executable pages that the loader may transform the bi- 
nary into and A denotes the possible virtual addresses 
at which the loader may place the transformed binary. 
The oracle for a particular binary format is a function 
O,(M, A, P) = B, which given a page M detected as 
executing by the hypervisor, the virtual address of the 
executing code A, and the process it was executing in 
P, produces a set of binaries B, from which the page 
could have originated. Since M and A are produced 
by the loader, they are elements of sets M and A re- 
spectively. One cannot implement O; by only relying 
on S, since a rootkit can subvert S. This inability to 
safely infer S' represents the semantic gap that the iden- 
tity oracles bridge. Since we do not know S, Oz’s task 
can be generalized to searching the set MA’ for the ob- 
served code page and address (MW, A), where MA’ con- 
tains all code page/address combinations that the loader 
could have generated for all binaries and all legitimate 
OS states. 


MA’ can be very large, making the performance cost 
of a naive search impractical. For example, in Windows, 
a code page can be mapped at 27° possible locations (for 
a 32-bit address space when using 4KB pages) and its 
contents will be different for each of those possible loca- 
tions. If applied to code pages in all binaries in an aver- 
age Windows installation, this would result in an MA’ 
several terabytes in size, which would be overly expen- 
sive to search. To reduce these costs, we exploit two 
characteristics that every binary format we have exam- 
ined exhibits. The first is that these formats specify that 
code sections should be mapped to contiguous regions 
of memory. As a result, once the binary that occupies 
a memory region in a process is known, the oracle only 
needs to check that other code executing in the same re- 
gion is the appropriate page in the same binary, elim- 
inating the need to search MA’ in these instances (in 
this case, binary can refer to a program binary or a dy- 
namically linked library). Knowing the address where 
a binary is mapped also enables the oracle to reverse 
run-time modifications and derive the original code page, 
eliminating the need to store all versions of the page. To 
establish what binary occupies a region, the oracle ex- 
ploits the second characteristic: binary executables have 
only a few entry-points (usually only one), which are 
executed before any other code in the binary. As a re- 
sult, if code executes in a memory region where the or- 
acle has not identified a binary before, the oracle only 
has to check for code at pages containing entry-points 
in MA’. This reduces the search space, and also adds 
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a desirable security check since the oracle will identify 
code as “not present” if the malware tries to jump into 
a binary at any point other than a legitimate entry-point. 
We use these assumptions about binaries as hints to im- 
prove the performance of Patagonix. However, Patago- 
nix does not trust these hints, so its security guarantees 
are not affected — tampering with the binaries that vio- 
lates these assumptions will result in the tampered binary 
being identified as “not present”. 


Figure 2 illustrates our oracle construction framework. 
Four components in the framework are binary loader spe- 
cific. The first is an entry-point database, which contains 
information on the entry-points of known binaries. This 
database is searched using an entry-point search func- 
tion. The other two components are the code database, 
which contains information on the rest of the code sorted 
by binary, and the code check function which checks 
code against the code database. An oracle invocation 
begins with the control logic forwarding the page con- 
tents, faulting virtual address and process to the oracle. 
The oracle first checks whether the virtual address and 
process of the code are from a region where the binary 
is known. If not, then the binary has just started exe- 
cuting because no code has been observed executing at 
this location before. The oracle searches the entry-point 
database for a match to identify the binary. If a match is 
found, it records the binary’s name and memory range it 
should occupy and returns the name of the binary. Oth- 
erwise, the oracle identifies the code as “not present” in 
the database. 


If the address is from a memory region whose binary 
has been previously identified, then the oracle checks 
that the executing page is from the associated binary. If 
it is, the oracle returns the name of the binary. If it is not, 
then the binary no longer occupies that memory range 
in that process. The memory region record is removed 
and the oracle searches for the page in the entry-point 
database. 


USENIX Association 


We have observed cases of related binaries containing 
identical code pages. If there have not been enough pages 
executed to uniquely identify the binary, the identity ora- 
cles return a list of candidate binaries until a unique page 
of code is executed. Should a page contain a mix of data 
and code, the oracles also return the sub-page range of 
the code. 


4 Oracle Implementation 


In this section, we describe the oracles we have con- 
structed for various binary formats and their loaders. We 
find that while binary formats may differ, the operations 
performed by the loaders of these formats have similari- 
ties, allowing common techniques to be used across the 
oracles for different formats. We classify our oracles into 
two categories based on the type of binaries they iden- 
tify. The first category consists of oracles for application 
code in Linux and Windows. We discuss support for the 
two main methods for dynamic code loading: position 
independent code and run-time code relocation, both of 
which are represented in the ELF and PE formats used 
by Linux and Windows respectively. The other category 
consists of kernel code in Linux and Windows. This code 
poses some extra challenges because both kernels con- 
tain self-modifying code. However, our oracles are able 
to verify that they are applied correctly. Finally, we fin- 
ish this section with a discussion on the generality of our 
identity oracles. 


4.1 Application Binary Oracles 


ELF Oracle. The Executable and Linkable Format 
(ELF) [33] is used by Linux, as well as other OSs such as 
Solaris, IRIX and OpenBSD. An ELF file is divided into 
segments and contains a program header table that speci- 
fies the address at which each segment should be mapped 
into memory. ELF segments in the binary are identical to 
the segments that will be loaded in memory and no run- 
time modifications are required from the loader. Code 
in executable segments can either be relocatable, mean- 
ing it can be loaded at any address in memory, or non- 
relocatable, meaning that it must be loaded at a particular 
address. All references to absolute addresses in relocat- 
able code go through indirection tables, which are filled 
in by the run-time linker. ELF shared libraries are typi- 
cally relocatable, while executable binaries are typically 
non-relocatable. 

Since ELF shared libraries use position independent 
code, both ELF libraries and ELF applications are map- 
ped from disk into memory without any modifications, 
making this our simplest oracle. To populate the entry- 
point database for the ELF oracle, pages containing 
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entry-points are placed in the database — all shared ob- 
jects have an init subroutine that is run when the 
shared object is loaded and executables always begin 
execution in _start. To save space, the ELF oracle 
does not store the entire page contents in the database, 
but instead stores a cryptographic hash (SHA-256) of the 
page instead. The hashes are stored in a sorted list and 
the entry-point search function computes the hash of the 
page where code execution was detected and searches the 
entry-point database for a match. 

The code database stores hashes of all pages for each 
binary in a two dimensional array that is indexed first 
by binary and second by page offset from the beginning 
of the binary. The check function uses the binary name 
attached to the memory region to compute the first index 
in a look up and the offset of the executing page from 
the start of the memory region to compute the second 
index. A hash of the executing page is then compared to 
the hash from the code database. Because SHA-256 is 
collision-resistant and difficult to invert, any tampering 
of the binary will result in the binary being identified as 
not present. 

PE Oracle. The Portable Executable (PE) format [19] 
is used in all versions of Windows after Windows NT 
3.1. Similar to ELF files, PE files have a header table 
that describes how sections in the file should be mapped 
in memory. However, code in PE files contains absolute 
addresses, and thus is not position independent. All PE 
files have an image base, which indicates the preferred 
address for loading the file. If an application needs to 
load two or more Dynamically Linked Libraries (DLL) 
that occupy overlapping preferred address regions, the 
OS must relocate one or more of the binaries. To do 
this, the absolute addresses in the executable must be ad- 
justed by adding the offset between the preferred address 
and the actual address where the binary is loaded. This 
relocation operation is performed by the OS using the 
information stored in the binary header. 

PE binaries pose two challenges. First, because the OS 
may adjust the absolute addresses in a binary, one cannot 
directly use page contents to identify code pages in the 
entry-point database. Instead, the PE oracle exploits the 
fact that the PE loader only relocates binaries by 4KB 
page offsets, meaning that the offset of the entry-point 
from the top of the page (i.e. the page-offset) is always 
the same. Thus, the entry-point database is indexed by 
the page-offset of the entry point and contains the loca- 
tions of the absolute addresses in each candidate page, as 
well as a hash of its contents. The search function then 
searches the entry-point database for the page-offset of 
the faulting address to determine the binary. 

In some cases, several binaries may have the same 
entry-point offset, so the search function must find the 
matching page within a set of more than one candidate 
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pages. For each candidate, the search function undoes 
the absolute address adjustments made by the OS during 
relocation. This is accomplished by making a copy of 
the executed page and subtracting the relocation offset 
from each absolute address. This offset is the difference 
between the entry-point address of the executed page and 
the entry-point address of the candidate if it were mapped 
at its preferred address. A hash of the copy can then be 
compared against the hash of the candidate. 

The second challenge is that some PE binaries have 
memory pages that contain both code and mutable data. 
For example, the Import Address Table (IAT), which is 
used to dynamically link DLLs against an application, is 
typically put in the code section by the Microsoft com- 
piler. As a result, the search function only uses the por- 
tions of these pages that contain code to identify them, 
and will notify the control logic, which in turn will in- 
struct the hypervisor to make only the identified por- 
tions of the pages executable. Naturally, the entry-point 
database entries for these pages must also contain infor- 
mation listing what portions of the page contain code. 

The rest of the PE oracle is straightforward. The 
code database and check function are also similar to the 
ELF oracle except that they must undo any relocations 
before comparing the page contents and they must ac- 
count for pages that only partially contain executable 
code. Thus, the code database also stores the preferred 
address with each binary, and the locations of all abso- 
lute addresses and sub-page code ranges (if necessary) 
with each page entry. To undo the relocations, the check 
function uses the actual address the binary was mapped 
in at, which is given by the start address of the memory 
region record, and then uses the same technique as the 
entry-point search function. In this way, the PE oracle 
provides the same guarantees as the ELF oracle. 


4.2 Kernel Binary Oracles 


Linux Kernel Oracle. The Linux kernel’s code pages in 
memory are not always identical to their on-disk repre- 
sentation. Recent versions of the Linux kernel customize 
their binaries at run-time depending on the availability of 
more efficient instructions for the CPU the kernel is exe- 
cuting on. For example, the kernel will implement mem- 
ory barriers with LFENCE and MFENCE instructions if 
running on newer x86 processors with SSE2 extensions. 
Altering these instructions at run-time allows a single 
kernel binary to be used on different CPUs. In addition, 
the Linux kernel can dynamically load and unload kernel 
modules at run-time. 

The aspects of the Linux kernel that differentiate it 
from application code are self-modifying code and the 
ability to dynamically load modules. However, both of 
these can be handled with the techniques used in the PE 
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oracle. In the Linux kernel, the locations of customiz- 
able instructions, the instructions they can be replaced 
with, and the conditions to permit replacement are stored 
in special sections of the kernel binary. Using this infor- 
mation, the search and check functions make a copy of 
the page, verify that the substitutions are legitimate, and 
then undo them by replacing them with the default on- 
disk instructions. The pages are then hashed and com- 
pared against the entries in the databases. 

Linux kernel modules can be loaded at any location 
in memory and have both relocations and customizations 
that are adjusted at load-time. They also contain an ini- 
tialization function that can serve as an entry-point for 
the module, making their loader very similar to that of 
a PE DLL. As a result, much like in the PE oracle, the 
Linux kernel oracle uses an entry-point database consist- 
ing of entry-point offsets. Once a kernel module is iden- 
tified, the memory range it occupies is recorded. 

Windows Kernel Oracle The Windows kernel ex- 
hibits behavior similar to the Linux kernel, where some 
of its code pages are customized at run-time by patch- 
ing the kernel code. In addition, Windows also permits 
run-time loading of kernel modules and drivers. 

Unlike the Linux kernel, the Windows kernel’s re- 
placements are not specified in the kernel binary, but 
are applied in an ad hoc fashion by various functions 
throughout the kernel. However, since these customiza- 
tions are deterministic for a given hardware platform and 
occur early during boot, it is possible to record the cus- 
tomizations from a pristine kernel and use these to verify 
the customizations in the monitored VM. While this ap- 
proach cannot guarantee completeness (for example, we 
do not know what replacements will take place for other 
hardware), we believe that a developer with more infor- 
mation about the Windows kernel customizations would 
be able to exhaustively enumerate the transformations 
the kernel performs at run-time. The Windows kernel 
oracle handles the run-time loading of drivers in exactly 
the same way as the Linux kernel oracle. 

Both the Linux kernel oracle and the Windows kernel 
oracle provide the same guarantees as the ELF and PE 
oracles. While the PE oracle validates relocations by us- 
ing the difference between the actual address and the pre- 
ferred address, the kernel oracles perform an equivalent 
validation for run-time customizations by ensuring that 
modified instructions are replaced with legitimate substi- 
tutes. 

Windows Interrupt Handler Oracle. To allow 
drivers to register interrupt service routines, the Windows 
kernel provides an interrupt object abstraction. To al- 
low for driver portability, when such an interrupt object 
is initialized by the driver, 106 bytes of kernel-specific 
code is copied from an interrupt handling template into 
the object, and will be executed whenever an interrupt 
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associated with the object occurs [28]. 

While this appears to be a form of dynamic code gen- 
eration, it is actually very easy to write an oracle that 
identifies the Windows Interrupt Handler. The code is 
shorter than a page, so it can be efficiently identified and 
validated in its entirety with one oracle invocation. As a 
result, the Interrupt Handler oracle does not need a code 
database or check function. Furthermore, the code is ex- 
actly the same every time it is copied except for an 8 
byte field that contains run-time parameters and absolute 
addresses, which is customized for each driver. As a re- 
sult, no entry-point database exists for this oracle, and the 
search function simply performs a byte-by-byte compar- 
ison of the code starting at the faulting address with the 
106 byte template. If there is a match, the code is iden- 
tified as a Windows Interrupt Handler and only the 106 
byte region is made executable and non-writable. 

Our prototype oracle currently does not perform fur- 
ther checks on the 8 bytes that are modified dynamically 
by the kernel. This means that an attacker can arbitrar- 
ily modify these bytes. However, this is a small amount 
of memory, and these bytes are not contiguous. A more 
sophisticated oracle could also validate the contents of 
these bytes. 


4.3 Discussion 


To better understand the generality of the approaches we 
have employed for our prototype oracles, we examined 
descriptions of other common binary formats and load- 
ers. We found that for application code, the main reason 
for run-time code modifications is to support the need 
to be able to dynamically load libraries at any base ad- 
dress. Nearly every binary format we examined, which 
included common formats such as the Mac OS X Mach- 
O format, the COFF format used by SysvV, and a.out, uses 
either position independent code or rebasing — both of 
which we are able to handle. 

Another interesting class of loaders are executable 
packers. They incorporate code into a compressed bi- 
nary to decompress the code just before execution. As a 
result, the compressed binary needs to be unpacked first 
before the oracle gathers information from it. This ex- 
tra step is conducted when Patagonix adds a packed bi- 
nary to the code database. Our prototype currently only 
handles binaries that have been packed using the popular 
UPX [21]. To support additional packers, Patagonix only 
needs to be provided with an unpacker. For example, Pa- 
tagonix could use PolyUnpack [26] to automatically sup- 
port a large number of executable packers. 

Finally, we observed two non-JIT binaries that dynam- 
ically generate code: winlogon. exe, which authenti- 
cates users, and the Windows Genuine Advantage appli- 
cation, which checks the Windows OS for evidence of 


17th USENIX Security Symposium 251 


252 


piracy. No formal specification exists for the code gen- 
erated by these applications and there is evidence that 
the code is generated to obfuscate self-integrity-checking 
operations. Without more information (like we had for 
the Windows interrupt handlers) or reverse engineering 
(which would violate the EULA), we cannot build an or- 
acle that validates the legitimacy of the generated code. 
Thus, these binaries are treated as JITs — we can identify 
that they are executing, but do not examine other code 
pages in their address space. 


5 Framework Implementation 


We used the Xen 3.0.3 hypervisor as a basis for building 
our Patagonix prototype. When used in Hardware Vir- 
tual Machine (HVM) mode, Xen utilizes virtualization 
support in x86 processors to run unmodified operating 
systems, including both Linux and Windows. With the 
exception of our emulated sub-page privileges support, 
our implementation of Patagonix can run on both AMD 
and Intel processors. In implementing Patagonix, we 
found that while the MMU provides a way to efficiently 
detect code execution, care needs to be taken to ensure 
that all code execution in the monitored VM is detected. 
Another shortcoming of the processor support was the 
inability to allow or deny execution or write pages at a 
sub-page granularity. Finally, we discuss a performance 
optimization that reduces the number of Patagonix VM 
invocations the hypervisor must make. 


5.1 Detecting Code Execution 


The non-executable permission bit was primarily imple- 
mented to allow an OS to prevent unauthorized code ex- 
ecution. When this mechanism is virtualized, there are 
two issues that must be taken into account to ensure that 
all instances of new code execution are detected by the 
hypervisor. 

The first issue arises from the fact that page permission 
bits apply to a virtual page mapping and not to a physical 
page. Since there can be more than one virtual mapping 
for a physical page, our hypervisor modifications must 
ensure that there cannot be writable and executable map- 
pings of a physical page simultaneously. Otherwise, the 
rootkit could use one mapping to modify the page and 
the other to execute it. We accomplish this by leverag- 
ing Xen’s frame map, which maintains a count of the 
number of mappings of each physical page. Whenever a 
page changes from writable to executable or vice versa, 
Xen consults the count in the frame map to see if any 
other virtual mappings need to be updated appropriately. 
Xen’s frame map only maintains a count of the number 
of mappings, and is not a reverse frame-map; as a result, 
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we must walk the page tables to find and change all other 
mappings. 

This issue could also be fixed by upcoming nested- 
page table (NPT) support, which provides full hard- 
ware virtualization support for page tables. NPTs add a 
shadow page table, which allows the hypervisor to spec- 
ify a second translation between the guest physical frame 
numbers and the actual machine frame numbers. With 
this, the hypervisor could simply control the permissions 
for the machine frames, removing the need to track the 
number of guest virtual mappings for each physical page. 
To be notified when new code is executed, Patagonix 
marks pages as non-executable in the shadow page ta- 
ble, and then makes them executable after they have been 
identified. We do note that in doing this, Patagonix will 
negate one of the possible advantages of NPTs, which is 
to allow superpage mapping of a contiguous set of guest 
physical frames with a single NPT entry. 

The second issue stems from the fact that the virtual 
Direct Memory Access (DMA) unit in Xen runs in a sep- 
arate protection domain (the privileged domain0O) and 
thus is not constrained by the page access restrictions 
placed on the rest of the monitored VM. Malware that is 
aware of this could abuse the virtualized DMA to mod- 
ify memory pages that have been marked as executable 
and read-only. To make sure that memory content was 
always checked before being executed, we modified the 
emulated DMA devices to inform the hypervisor when 
they write to any pages. If any of these pages are marked 
as executable, Xen makes these pages non-executable 
again. 


5.2 Sub-page support 


Sub-page permissions are necessary when a memory 
page contains a mix of identified code and mutable 
data: the code must be made non-writable, and the data 
must be made non-executable. Ideally, sub-page support 
would be provided in hardware using a scheme such as 
Mondrian memory [35] or Transmeta’s Crusoe proces- 
sor [8]. However, because such support is not available 
on x86 processors, we devised a method to emulate this 
support based loosely on a technique that Van Oorschot 
et al. used to circumvent code tampering detection [34]. 
The technique takes advantage of the separate Transla- 
tion Lookaside Buffers (TLB) for instructions (ITLB) 
and data (DTLB) present in x86 processors. 

Our solution maps an execute-safe version of the page 
to a virtual address for instructions, and the original to 
the same virtual address for data. The execute-safe ver- 
sion is a copy of the mixed page where the data sections 
have been made non-executable by replacing them with 
trap instructions. A mapping to this version is loaded 
into the ITLB by temporarily setting the shadow page 
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table entry to be executable, pointing it to the execute- 
safe version and executing a single instruction from that 
page. After that, the shadow page table entry is switched 
back to the original page and made writable and non- 
executable. This emulates the sub-page permission con- 
trol we require since any attempt to execute at an address 
from the data regions will go through the ITLB and re- 
sult in a trap, and any modifications to the code region 
will go through the DTLB and will not be applied to the 
page that instructions are being fetched from. To ensure 
that the execute-safe page is not accidentally loaded into 
the DTLB by an unintended load or store while setting 
up the TLBs, Patagonix disables interrupts for the moni- 
tored VM during this operation. 

The emulation has some drawbacks over native hard- 
ware support. First, the emulation does not trap into 
the hypervisor when a write is attempted to a code re- 
gion. Such functionality would be needed to deal with 
run-time modifications to a mixed page, but we have not 
found this necessary in practice. Second, this TLB ma- 
nipulation needs to be undertaken every time to correctly 
load the ITLB mapping for this page, ITLB misses for 
such pages are transformed into page faults that require 
two traps into the hypervisor. Finally, this functionality 
cannot be emulated on Intel processors because, at the 
time of writing, Intel processors flush both TLBs on ev- 
ery crossing between the hypervisor and the VM. 


5.3. Performance Optimizations 


The dominant source of overhead in Patagonix is the 
page faults that occur when the monitored VM executes 
pages marked non-executable by Patagonix and the sub- 
sequent Patagonix VM invocation to identify the newly 
executing code. Some of these page faults are unnec- 
essary because the executing code is on a physical page 
that has already been identified when it was executed in 
another process. Thus, we added an optimization that 
avoids the extra page fault and Patagonix VM invocation 
for pages whose identities are already known. This is ac- 
complished by maintaining a list of physical pages that 
have been identified and whose virtual mappings are all 
executable and non-writable. When the monitored VM 
attempts to map such a page as executable in a new pro- 
cess, Patagonix preemptively makes the new mapping 
executable and non-writable. 

The hypervisor must log each time this optimization is 
applied for two reasons. One reason is because this in- 
formation is required to maintain the consistency of the 
memory region information for the oracles. The second 
reason is that this information is required by the Patago- 
nix VM to maintain an accurate record of when pages 
from each binary were observed executing. To avoid ex- 
tra domain crossings but keep the Patagonix VM’s view 
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of the monitored VM current, this log is read by the Pa- 
tagonix VM whenever it is invoked by the hypervisor to 
identify a page, whenever it requests the hypervisor to 
perform a refresh and whenever the user requests a list of 
executed binaries through the management console. As 
a result, this optimization has no effect on how current 
the Patagonix VM’s information on executing binaries 
is, and thus has no impact on the security guarantees of 
Patagonix. 


6 Usage 


Patagonix has two usage modes. In reporting mode, Pa- 
tagonix provides trustworthy execution-reporting infor- 
mation and is functionally similar to utilities such as ps, 
lsmod and the task manager. This gives the sys- 
tem administrator a trustworthy alternative information 
source when evaluating if their system has processes hid- 
den by a rootkit, or whether an attempted rootkit removal 
has been successful. In lie detection mode, Patagonix 
compares the list of executing binaries reported by the 
monitored OS with what it detects is executing. Differ- 
ences mean that the OS is lying and indicate that a rootkit 
is present on the system. 

When in reporting mode, Patagonix displays a list of 
all executing binaries on the management console. This 
is semantically similar to the list displayed by utilities 
such as top or the task manager. Patagonix also 
displays the times they were last observed executing. The 
administrator can also use Patagonix to terminate or sus- 
pend the execution of all instances of a binary by issuing 
commands to the management console, creating a trust- 
worthy version of the UNIX kil1 utility. To terminate 
a binary, Patagonix sets all pages of that binary to non- 
executable. When an execution fault occurs on one of 
the code pages, Patagonix replaces the instruction at the 
faulting address with an illegal instruction. This makes 
it appear to the monitored OS that the binary tried to ex- 
ecute an illegal instruction, causing the monitored OS to 
terminate it. Suspending execution is achieved by replac- 
ing the code with an empty loop instead of replacing it 
with an illegal instruction. Thus, the binary is still ex- 
ecuting from the OS’ point of view, yet no code from 
the actual binary is being executed. A more efficient, but 
OS-specific implementation could inject code that causes 
the application to sleep. 

In lie detection mode, Patagonix compares execution 
information reported by the monitored OS with its own 
list of executing binaries. Patagonix obtains execution 
information from the monitored OS via an agent in the 
monitored VM. The agent is a program that queries the 
monitored OS via standard interfaces to obtain a list of 
executing processes. Previous systems that performed 
lie detection in this way can suffer from false positives 
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Target OS Rootkits 
Adore, Adore-ng, Knark, Synapsys 
Adore-ng-2.6, Enyelkm 


Windows XP | Fu, Hacker Defender, Vanquish 





Table 1: Rootkits detected by Patagonix. In reporting 
mode, Patagonix is able to identify processes hidden by 
these rootkits and/or detect tampering of processes by 
these rootkits. In lie detection mode, Patagonix detects 
that the OS is under reporting the binaries that are run- 
ning. 


due to asynchrony between the measurement of running 
processes taken from within the monitored OS and the 
measurement taken from the hypervisor — a new process 
may begin executing and be detected by the hypervisor 
before the OS has had a chance to update the information 
it exports to the agent [10, 13]. To avoid this, Patagonix’s 
agent registers a function with the OS kernel that syn- 
chronously informs Patagonix of process creation via a 
hypercall. Both Linux and Windows provide facilities 
for this. 


Patagonix’s lie detection detects both OS under- 
reporting (hiding executing binaries) and over-reporting 
(reporting binaries that are not actually executing). Usu- 
ally, rootkits under-report to hide the execution of mali- 
cious binaries, but over-reporting could also be used ma- 
liciously. For example, a rootkit may wish to lead the ad- 
ministrator to believe that a critical program (such as an 
anti-virus scanner) is still running when it is not. Over- 
reporting requires the administrator to specify a thresh- 
old which dictates how long Patagonix will allow a bi- 
nary that is reported as executing by the OS to be not 
observed running any code before declaring it as being 
over-reported. 


7 Evaluation 


We evaluate two aspects of Patagonix: its effectiveness 
at detecting and identifying hidden processes and rootk- 
its and the performance overheads introduced by adding 
Patagonix to the hypervisor. 


All experiments were carried out on a machine with 
an AMD Athlon 64 X2 Dual Core 3800+ processor run- 
ning at 2GHz, with 2GB of RAM. We used the Xen 3.0.3 
hypervisor and allocated 512MB of RAM to the mon- 
itored VM and 1GB of RAM to the domain 0 VM, 
which also doubles as the Patagonix VM. Unless stated 
otherwise, the monitored VMs contain either Windows 
XP SP2 or Fedora Core 5 with a 2.6.19 Linux kernel. 
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7.1 Effectiveness 


To evaluate the effectiveness of Patagonix at identify- 
ing covertly executing binaries, we used Patagonix to 
monitor VMs containing the nine rootkits listed in Ta- 
ble 1. These rootkits target the Windows kernel and 
Linux kernel versions 2.4 and 2.6. For this experiment, 
they were installed in VMs running Windows XP SP2, 
version 2.4.35.4 of the Linux kernel, and version 2.6.14.7 
of the Linux kernel (The rootkits that targeted Linux 2.6 
kernels did not work with version 2.6.19 of the kernel). 
We evaluated Patagonix in both reporting and lie detec- 
tion mode. 

First, we ran Patagonix on monitored VMs that have 
been infected with the rootkits. Each rootkit (except Van- 
quish) was configured to hide a process on the monitored 
OS: an instance of Freecel1 on Windows and an in- 
stance of top on Linux. We then verified that the hid- 
den processes were not visible to the standard execution- 
reporting utilities on the respective OSs. In reporting 
mode, Patagonix was able to neutralize all the rootkits 
and report the execution of the covert code to the ad- 
ministrator, as illustrated in Figure 3. Likewise, in lie 
detection mode Patagonix is able to detect the tamper- 
ing performed by each of the rootkits without fail. The 
Vanquish rootkit does not hide processes like the other 
rootkits. Instead, it tampers with applications by inject- 
ing code into the address space of executing processes. 
In these cases, the executing code of the tampered bi- 
naries is correctly identified as “not present” since it no 
longer matches any binary in the database. This warn- 
ing should be interpreted as a likely rootkit infection by 
the administrator since the only other cause would be a 
missing binary in the trusted database. 

Second, we ran Patagonix on VMs that did not have 
any rootkits installed to see if Patagonix reports any false 
positives. We exercise the VMs using the various appli- 
cation and microbenchmarks described in the following 
sections. During these tests, all executing code was cor- 
rectly identified. When run in lie detection mode on an 
uninfected VM, Patagonix reported no discrepancies be- 
tween the processes reported by the monitored OS and 
that detected by Patagonix. 


7.2  Microbenchmark 


To understand the overheads introduced by Patagonix, 
we devised chain, a microbenchmark that touches a new 
page of code on every instruction by chaining together a 
series of jumps, each targeting the beginning of the next 
page. Chain represents the worst case scenario for Pa- 
tagonix: every instruction requires Patagonix to identify 
the new page of executable code. We instrumented our 
prototype to break down the page identification process 
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0x77110000 0x77f10000 0x00047000 Avindows/system32/gdi32 dll 01:02PM 
O0x7c8600000 Ox7c600000 0x000f4000 Avindows/system32/kemel32.dll 01:01PM 
0x7c900000 Ox7c900000 0x000b0000 Awindows/system32/ntall.dil 01:01PM 










Image Name [User Name | cpu] Mem Usage | 
calc,exe Lionel oo 2,440 K 
CSRSS,EXE SYSTEM 09 3,212 K 
EXPLORER.EXE Lionel 00 14,956 K 
LSASS.EXE SYSTEM 00 972K 
SERVICES.EXE SYSTEM oo 3,868 K 
SMSS.EXE SYSTEM oo 388K 
SVCHOST.EXE SYSTEM 00 4,528 K 
SVCHOST.EXE NETWORK SERVICE 00 3,888 K 
SVCHOST.EXE NETWORK SERVICE 00 3,068 K 
SVCHOST.EXE SYSTEM 00 7,516 K 
System, SYSTEM 02 236K 
System Idle Process SYSTEM 88 28K 
eerie eel ayy 2,152 K 
WINLOGON.EXE SYSTEM 00 6,244 K 
winmine.exe Lionel oo 2,180 K 


|!” Show processes from all users 


End Process 


Processes; 15 Commit Charge: 54M / 1055M, 


M 3:10PM 





Figure 3: Output of both Patagonix and the Task Manager when the FU rootkit is used to hide freecell.exe. 
Patagonix identifies all processes including freecel1l.exe, while the Task Manager does not display the hidden 
process. Patagonix identifies “System” as ntkrnlpa.exe, the name of the Windows XP kernel binary. 
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Figure 4: Execution time for various components of the 
identification operation. The total height of the bars rep- 
resents the average time required to identify the origin of 
an executing code page. 


into its different components. Figure 4 details the over- 
head incurred when identifying one page of code; the 
values presented are the average of 10,000 Patagonix in- 
vocations, and the standard deviations for each compo- 
nent were consistently less than 5% of the average. 


When reaching a new page of code, a page fault is 
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triggered by the MMU. This results in an unavoidable 
hardware cost due to the VMexit and VMenter opera- 
tions in and out of the hypervisor. After a VMexit, a 
software page fault handling cost is incurred that is spe- 
cific to Xen’s shadow page table implementation; we ex- 
pect it to change with other hypervisor implementations. 
The Patagonix’s hypervisor code is then executed; run- 
ning this code is extremely brief (approximately 0.38), 
attesting to its minimal impact on the hypervisor. This 
code triggers a context switch into the Patagonix VM, 
where a hypercall is executed to retrieve the executing 
page information. These two operations cost a total of 
40us, but enable 2080 out of a total 3544 lines of code 
to be implemented in the Patagonix VM instead of the 
hypervisor. The hash computation necessary for all ora- 
cles accounts for 73s, nearly half of the page identifica- 
tion time. As expected, the PE oracle logic takes slightly 
more time than the ELF oracle logic. We note that the 
case in which the PE search function has to match an 
entry-point page against several candidates will be more 
expensive, as each candidate binary requires a hash com- 
putation; we have observed times as high as 538s. For- 
tunately, this only happens very rarely and the search is 
only performed once per binary mapped in memory. 
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Linux (%) | WinXP (%) | WinXP-hw (%) 
Apache Build 
Boot 30.39 10.63 


[Boor | 
SPECINT 2006 
perlbench 201 





ec 
Table 2: Application benchmark results. Results are the 
average of ten runs and are given in percent overhead 
over vanilla Xen. All standard deviations were less than 
3% of the mean. WinXP-hw is estimated performance 
with hardware support for sub-page permissions. 


7.3. Application Benchmarks 


Since Patagonix is only invoked when code is executed 
for the first time, we expect this to coincide with page 
faults that load code from the disk. Because disk oper- 
ations are expensive to begin with, we expect Patagonix 
overhead to be minimal in practice. To confirm this, we 
ran several application benchmarks in both the Linux and 
Windows VMs in our prototype. Computationally inten- 
sive applications are represented by the benchmarks from 
the SPECINT 2006 suite. For workloads with larger code 
footprints, we also measured the time Patagonix takes to 
boot Windows and Linux, as well as to build Apache. We 
compare the execution time for each benchmark against 
a vanilla Xen system running the same benchmark on the 
same monitored VM and report the overheads in Table 2. 
Since the PE oracle uses sub-page emulation, we also ran 
benchmarks without the emulation and sub-page checks 
(WinXP-hw column) to approximate what the perfor- 
mance might be if hardware support were available. 

We report the SPECINT benchmarks as an aggre- 
gate because overheads for all benchmarks where less 
than 3% for the three configurations except for gcc and 
perlbench, whose performance we report separately. 
The Windows boot and gcc have large code footprints 
in comparison to their execution time: Windows initial- 
izes several services, drivers and interrupt handlers dur- 
ing boot, while SPEC drives gcc with a set of tests that 
exercises a large number of code paths. perlbench 
does not experience high overhead except in the WinXP 
configuration because it spends a high portion of its time 
running code on mixed code/data pages, motivating ar- 
chitectural support for sub-pages in such cases. As ex- 
pected, the overhead for all other benchmarks is low. 
This is because their code footprint is small relative to 
their execution time. 

Finally, the Patagonix VM needs to request periodic 
refreshes from the hypervisor. A shorter refresh interval 
means more accurate information about when a process 
was last observed executing, but also incurs more over- 
head. Figure 5 plots the additional overhead the Apache 
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Figure 5: Overhead and Invalidations vs. Refresh Pe- 
riod. Apache Build on Linux. Averages of five runs with 
standard deviations below 2% of the average. 


build benchmark in Linux experiences for various refresh 
periods, as well as the number of executable pages that 
are invalidated (set non-executable) each time. More fre- 
quent refreshes mean less time for the application to ex- 
ecute various pages, resulting in fewer invalidations. 


8 Related Work 


The problems associated with the semantic gap between 
the hypervisor and guest VMs were first identified in 
a seminal paper by Chen and Noble [6]. Since then, 
there have been several attempts to bridge this gap us- 
ing non-binding information derived from source code 
and symbol information. For example, Livewire [10], 
Copilot [23] and SBCFI [25] rely on symbol informa- 
tion in kernel binary or System.map file, while As- 
rigo et al. [3] and VMWatcher [13] rely on information 
derived from kernel source code. Because they make as- 
sumptions based on non-binding information, they are all 
prone to evasion by a rootkit that breaks those assump- 
tions. Patagonix does not rely on any non-binding infor- 
mation. 

The principle of lie detection — comparing two views 
of the same data for discrepancies — has been used in the 
literature. For example, Rootkit Revealer [7] and Strider 
GhostBuster [5] compare high-level and low-level views 
of the same system information. However, since both 
views are still derived from within the infected system, 
a thorough rootkit can make both high-level and low- 
level views agree, thus eluding these systems. Like Pa- 
tagonix, other systems compare views taken from both 
within (i.e. in-the-box) the infected system, and outside 
(out-of-the-box) the infected systems. For example, both 
Livewire [10] and VMWatcher [13] compare views of 
executing processes derived from the VMM with those 
gathered from within the monitored system. However, 
unlike Patagonix, these systems do not deal with asyn- 
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chrony between the measurement times of the in-the-box 
and out-of-the-box views and will thus suffer from false 
positives. Lycosid [15] also does lie detection by count- 
ing the number of address spaces in a VM. However, be- 
cause Lycosid does not identify which binaries the pro- 
cesses are executing and the hypervisor’s measurements 
contain noise, it can only probabilistically detect when 
the number of address spaces does not match the num- 
ber of processes reported by the OS. Because Patago- 
nix identifies processes and registers callbacks with the 
OS, Patagonix is able to both precisely detect hidden pro- 
cesses, as well as identify which process is being hidden. 


Like Patagonix, remote attestation systems also must 
identify and report executing binaries on a system. In ad- 
dition, they may also report the integrity of the data in a 
system, and are often used to report this information to a 
remote party instead of the system administrator. How- 
ever, these systems in general assume a weaker attack 
model since they in general rely on the integrity of the 
OS. For example, IMA [29], implements such function- 
ality directly in the OS kernel, and thus depends on the 
integrity of the OS kernel to report correct results. An al- 
ternative is Terra [9] which performs attestation in a hy- 
pervisor. Terra attests the identity of the virtual disk used 
to initialize a “closed box” to a remote party. Closed 
boxes are VMs that are fully managed by a third party 
and usually cannot be extended in any significant way. 
Since Patagonix allows the monitored OS to be arbitrar- 
ily extended as long as the hashes of any new legitimate 
code are in the trusted database. A combination of Pata- 
gonix and Terra’s abilities could enable support for attes- 
tation of open, extensible systems as well as individual 
programs executing in these systems. 

Hypervisors have long been used as a means for im- 
plementing a secure trusted computing base, with which 
untrusted OS images could be made secure [16, 31]. 
While our prototype was implemented in the Xen hyper- 
visor [4], the functionality required from the hypervisor 
is generic enough to allow Patagonix to be implemented 
on any virtualization system. To explore this point, we 
have obtained a source code license for VMware Work- 
station and are currently working on a port of Patago- 
nix. We have found that VMware-specific functionality, 
such as its page table entry caching [2] and dynamic code 
translation [1], have not impeded the necessary function- 
ality from being added. 

Finally, Patagonix uses or extends ideas presented in 
other work. Patagonix is based on our earlier work called 
Manitou, which also uses hashes to identify running ap- 
plications from a hypervisor [18]. However, Manitou is 
only able to identify applications for Linux guest OSs, 
making its treatment of the problem overly simplistic. It 
also does not perform synchronous lie detection. Inde- 
pendent to our work and using a similar low-level mech- 
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anism to detect code execution, SecVisor [31] restricts 
what code can be executed by a modified Linux kernel. 
SecVisor focuses solely on code that is executed in ker- 
nel mode. It uses a custom-made hypervisor, showing 
that execution control can be achieved with a small TCB. 
In contrast, Patagonix provides comprehensive guaran- 
tees for unmodified Linux and Windows OSs as well 
as the applications they execute, and demonstrates that 
these guarantees can be obtained by small extensions to 
a general-purpose hypervisor. Other projects have ma- 
nipulated the page tables used by the X86 MMU. For ex- 
ample, the PaX project [22] proposes manipulating these 
page tables to emulate the NX bit on older CPU that 
do no have hardware support for the feature. Finally, 
computer forensics experts [30] have demonstrated that 
PE binaries can be reconstructed by analyzing memory 
dumps. The PE identity oracle described in this paper 
uses similar techniques to identify binaries online. 


9 Conclusions 


Current OSs are vulnerable to subversion by rootkit and 
thus cannot be relied upon to provide trustworthy infor- 
mation about what code is executing on a system. Pata- 
gonix solves this problem by using the processor MMU 
to detect executing code from a hypervisor. It then uses 
identity oracles, which leverage information from the bi- 
nary format specifications and loaders to identify the ex- 
ecuting code. In this way, Patagonix is able to bridge the 
semantic gap between the hypervisor and the OS with- 
out having to trust non-binding information, which is 
vulnerable to subversion by the rootkit. We have found 
that binary formats across different OSs have similari- 
ties, enabling the creation of a universal oracle construc- 
tion framework and the use of common techniques across 
various binary formats. Aside from the binary-specific 
formats, the Patagonix framework does not use any in- 
formation about the OS, allowing the same framework to 
be used on diverse OSs such as Windows XP, Linux 2.4 
and Linux 2.6, without any modification. Through the 
combined use of writable and non-executable page table 
bits, Patagonix is only invoked when code is executed for 
the first time, and as a result, has a modest performance 
overhead of less than 3% on most applications. 
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Selective Versioning in a Secure Disk System 


Swaminathan Sundararaman 
Stony Brook University 


Abstract 


Making vital disk data recoverable even in the event of 
OS compromises has become a necessity, in view of the 
increased prevalence of OS vulnerability exploits over 
the recent years. We present the design and implemen- 
tation of a secure disk system, SVSDS, that performs 
selective, flexible, and transparent versioning of stored 
data, at the disk-level. In addition to versioning, SVSDS 
actively enforces constraints to protect executables and 
system log files. Most existing versioning solutions that 
operate at the disk-level are unaware of the higher-level 
abstractions of data, and hence are not customizable. We 
evolve a hybrid solution that combines the advantages 
of disk-level and file-system—level versioning systems 
thereby ensuring security, while at the same time allow- 
ing flexible policies. We implemented and evaluated a 
software-level prototype of SVSDS in the Linux kernel 
and it shows that the space and performance overheads 
associated with selective versioning at the disk level are 
minimal. 


1 Introduction 


Protecting disk data against malicious damage is one 
of the key requirements in computer systems security. 
Stored data is one the most valuable assets for most or- 
ganizations and damage to such data often results in ir- 
recoverable loss of money and man power. In today’s 
computer systems, vulnerabilities in the OS are not un- 
common. OS attacks through root kits, buffer overflows, 
or malware cause serious threat to critical applications 
and data. In spite of this, security policies and mecha- 
nisms are built at the OS level in most of today’s com- 
puter systems. This results in wide-scale system com- 
promise when an OS vulnerability is exploited, making 
the entire disk data open to attack. 

To protect disk data even in the event of OS compro- 
mises, security mechanisms have to exist at a layer below 
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the OS, such as the disk firmware. These mechanisms 
must not be overridable even by the highest privileged 
OS user, so that even if a malicious attacker gains OS 
root privileges, disk data would be protected. 

Building security mechanisms at the disk-level comes 
with a key problem: traditional disk systems lack higher- 
level semantic knowledge and hence cannot implement 
flexible policies. For example, today’s disk systems can- 
not differentiate between data and meta-data blocks or 
even identify whether a particular disk block is being 
used or is free. Disks have no knowledge of higher-level 
abstractions such as files or directories and hence are 
constrained in providing customized policies. This gen- 
eral problem of lack of information at the lower layers of 
the system is commonly referred to as the “information- 
gap” in the storage stack. Several existing works aim at 
bridging this information-gap [4, 11, 16, 18]. 

In this paper, we present the design and implementa- 
tion of SVSDS, a secure disk system that transparently 
performs selective versioning of key data at the disk- 
level. By preserving older versions of data, SVSDS pro- 
vides a window of time where data damaged by mali- 
cious attacks can be recovered through a secure admin- 
istrative interface. In addition to this, SVSDS enforces 
two key constraints: read-only and append-only, to pro- 
tect executable files and system activity logs which are 
helpful for intrusion detection. 

In SVSDS, we leverage the idea of Type-Safe Disks 
(TSD) [16] to obtain higher-level semantic knowledge at 
the disk-level with minimal modifications to storage soft- 
ware such as file systems. By instrumenting file systems 
to automatically communicate logical block pointers to 
the disk system, a TSD can obtain three key pieces of 
information that are vital for implementing flexible secu- 
rity policies. First, by identifying blocks that have out- 
going pointers, a TSD differentiates between data and 
meta-data. Second, a TSD differentiates between used 
and unused blocks, by just identifying blocks that have 
no incoming pointers (and hence not reachable from any 


17th USENIX Security Symposium 259 


260 


meta-data block). Third, a TSD knows higher abstrac- 
tions such as files and directories by just enumerating 
blocks in a sub-tree of the pointer hierarchy. For exam- 
ple, the sub-tree of blocks starting from an inode block 
of an Ext2 file system belong to a collection of files. 

Using this semantic knowledge, SVSDS aggressively 
versions all meta-data blocks, as meta-data impact the 
accessibility of normal data, and hence is more impor- 
tant. It also provides an interface through which ad- 
ministrators can choose specific files or directories for 
versioning, or for enforcing operation-based constraints 
(read-only or append-only). SVSDS uses its knowledge 
of free and used blocks to place older versions of meta- 
data and chosen data, and virtualizes the block address- 
space. Older versions of blocks are not accessible to 
higher layers, except through a secure administrative in- 
terface upon authentication using a capability. 

We implemented a prototype of SVSDS in the Linux 
kernel as a pseudo-device driver and evaluated its cor- 
rectness and performance. Our results show that the 
overheads of selective disk-level versioning is quite min- 
imal. For a normal user workload SVSDS had a small 
overhead of 1% compared to regular disks. 

The rest of the paper is organized as follows. Sec- 
tion 2 describe background. Section 3 discusses the 
threat model. Section 4 and Section 5 explain the de- 
sign and implementation of our system respectively. In 
Section 6, we discuss the performance evaluation of our 
prototype implementation. Related work is discussed in 
Section 7 and we conclude in Section 8. 


2 Background 


Data protection has been a major focus of systems re- 
search in the past decade. Inadvertent user errors, ma- 
licious intruders, and malware applications that exploit 
vulnerabilities in operating systems have exacerbated the 
need for stronger data protection mechanisms. In this 
section we first talk about versioning as a means for pro- 
tecting data. We then give a brief description about TSDs 
to make the paper self-contained. 


2.1 Data Versioning 


Versioning data is a widely accepted solution to data pro- 
tection especially for data recovery. Versioning has been 
implemented in different layers. It has been implemented 
above the operating system (in applications), inside the 
operating system (e.g., in file systems) and beneath the 
operating system (e.g., inside the disk firmware). We 
now discuss the advantages and disadvantages of ver- 
sioning at the different layers. 
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Application-level versioning. Application-level ver- 
sioning is primarily used for source code management [1, 
2,22]. The main advantage of these systems is that they 
provide the maximum flexibility as users can control ev- 
erything from choosing the versioning application to cre- 
ating new versions of files. The disadvantage with these 
systems is that they lack transparency and users can eas- 
ily bypass the versioning mechanism. The versioned data 
is typically stored in a remote server and becomes vulner- 
able when the remote server’s OS gets compromised. 


File-system-level versioning. Several file systems 
support versioning [6, 10, 12, 15, 19]. These systems are 
mainly designed to allows users to access and revert back 
to previous versions of files. The older versions of files 
are typically stored under a hidden directory beneath its 
parent directory or on a separate partition. As these file 
systems maintain older versions of files, they can also be 
used for recovering individual files and directories in the 
event of an intrusion. Unlike application-level versioning 
systems, file-system-—level versioning is usually transpar- 
ent to higher layers. The main advantage of these ver- 
sioning systems is that they can selectively version files 
and directories and can also support flexible versioning 
policies (e.g., users can choose different policies for each 
file or directory). Once a file is marked for versioning by 
the user, the file system automatically starts versioning 
the file data. The main problem with file-system—level 
versioning is that their security is closely tied to the se- 
curity of the operating system. When the operating sys- 
tem is compromised, an intruder can bypass the security 
checks and change the data stored in the disk. 


Disk-level versioning. The other alternative is to 
version blocks inside the disk [7,20,23]. The main 
advantage of this approach is that the versioning mech- 
anism is totally decoupled from the operating system 
and hence can make data recoverable even when the 
operating system is compromised. The disadvantage 
with block-based disk-level versioning systems is that 
they cannot selectively version files as they lack seman- 
tic information about the data stored inside them. As a 
result, in most cases they end up versioning all the data 
inside the disk which causes them to have significant 
amount of space overheads in storing versions. 


In summary, application-level versioning is weak 
in terms of security as can be easily bypassed by users. 
Also, the versioning mechanism is not transparent to 
users and can be easily disabled by intruders. File- 
system—level data-protection mechanisms provide 
transparency and also flexibility in terms of what data 
needs to be versioned but they do not protect the data in 
the event of an operating system compromise. Disk-level 
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versioning systems provide better security than both 
application and file system level versioning but they do 
not provide any flexibility to the users to select the data 
that needs to be versioned. What we propose is a hybrid 
solution, i.e., combine the strong security that the 
disk-level data versioning provide, with the flexibility of 
file-system—level versioning systems. 


2.2 Type-Safe Disks 


Today’s block-based disks cannot differentiate between 
block types due to the limited expressiveness of the block 
interface. All higher-level operations such as file cre- 
ation, deletion, extension, renaming, etc. are translated 
into a set of block read and write requests. Hence, they 
do not convey any semantic knowledge about the blocks 
they modify. This problem is popularly known as the in- 
formation gap in the storage stack [4,5], and constrains 
disk systems with respect to the range of functionality 
that they can provide. 

Pointers are the primary mechanisms by which data is 
organized. Most importantly, pointers define reachability 
of blocks; i.e., a block that is not pointed to by any other 
block cannot be reached or accessed. Almost all popular 
data structures used for storing information use pointers. 
For example, file systems and database systems make ex- 
tensive use of pointers to organize the data stored in the 
disk. Storage mechanisms employed by databases like 
indexes, hash, lists, and b-trees use pointers to convey 
relationships between blocks. 

Pointers are the smallest unit through which file sys- 
tems organize data into semantically meaningful entities 
such as files and directories. Pointers define three things: 
(1) the semantic dependency between blocks; (2) the log- 
ical grouping of blocks; and (3) the importance of blocks. 
Even though pointers provide vast amounts of informa- 
tion about relationships among blocks, today’s disks are 
oblivious to pointers. A Type-Safe Disk (TSD) is a disk 
system that is aware of pointer information and can use 
it to enforce invariants on data access and also perform 
various semantic-aware optimizations which are not pos- 
sible in today’s disk systems. 

TSDs widen the traditional block-based interface to 
enable the software layers to communicate pointer infor- 
mation to the disk. File systems that use TSDs should 
use the disk APIs (CREATE_PTR, DELETE_PTR, AL- 
LOC_BLOCK, GETFREE) exported by TSDs to allocate 
blocks, create and delete pointers, and get free-space in- 
formation from the disk. 

The pointer manager in TSDs keeps track of the re- 
lationship among blocks stored inside the disk. The 
pointer operations supported by TSDs are CREATE_PTR 
and DELETE-PTR. Both operations take two arguments: 
source and destination block numbers. The pointer 
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manager uses a P-TABLE (or pointer table) to main- 
tain the relationship among blocks inside the disk. En- 
tries are added to and deleted from the P-TABLE during 
CREATE_PTR and DELETE-PTR operations. When there 
are no incoming pointers to a block it is automatically 
garbage collected by the TSD. 

One other important difference between a regular disk 
and a TSD is that the file systems no longer does free- 
space management (i.e., file systems no longer need to 
maintain bitmaps to manage free space). The free-space 
management is entirely moved to the disk. TSDs export 
ALLOC_BLOCK API to allow file systems to request new 
blocks from the disk. The ALLOC_BLOCK API takes a 
reference block number, a hint block number, and the 
number of blocks as arguments and allocates the re- 
quested number of file system blocks from the disk main- 
tained free block list. After allocating the new blocks, 
TSD creates pointers from the reference block to each of 
the newly allocated blocks. 

The garbage-collection process performed in TSDs is 
different from the traditional garbage-collection mecha- 
nism employed in most programming languages. A TSD 
reclaims back the deleted blocks in an online fashion as 
opposed to the traditional offline mechanism in most pro- 
gramming languages. TSDs maintain a reference count 
(or the number of incoming pointers) for each block. 
When the reference count of a block decreases to zero, 
the block is garbage-collected; the space is reclaimed by 
the disk and the block is added to the list of free blocks. It 
is important to note that it is the pointer information pro- 
vided by TSD that allows the disk to track the liveness of 
blocks, which cannot be done in traditional disks [17]. 


3 Threat Model 


Broadly, SVSDS provides a security boundary at the disk 
level and makes vital data recoverable even when an at- 
tacker obtains root privileges. In our threat model, ap- 
plications and the OS are untrusted, and the storage sub- 
system comprising the firmware and magnetic media is 
trusted. The OS communicates with the disk through a 
narrow interface that does not expose the disk internal 
versioning data. Our model assumes that the disk sys- 
tem is physically secure, and the disk protects against at- 
tackers that compromise a computer system through the 
network. This scenario covers a major class of attacks 
inflicted on computer systems today. 

Specifically, an SVSDS provides the following guar- 
antees: 


e All meta-data and chosen file data marked for pro- 
tection will be recoverable to an arbitrary previous 
state even if an attacker maliciously deletes or over- 
writes the data, after compromising the OS. The 
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depth of history available for recovery is solely de- 
pendent on the amount of free-space available on 
disk. Given the fact that disk space is cheap, this is 
an acceptable dependency. 


e Data items explicitly marked as read-only is guar- 
anteed to be intact against any malicious deletion or 
overwriting. 


e Data items marked as append-only can never be 
deleted or overwritten by any OS attacker. 


It is important to note that SVSDS is designed to pro- 
tect the data stored on the disk and does not provide 
any guarantee on which binaries/files are actually exe- 
cuted by the OS (e.g., rootkits could change the binaries 
in memory). As files with operation-based constraints 
(specifically read-only constraints) cannot be modified 
inside SVSDS, upon a reboot, the system running on 
SVSDS would return to a safe state (provided the system 
executables and configuration files are marked as read- 
only). 


4 Design 


Our aim while designing SVSDS is to combine the se- 
curity of disk-level versioning, with the flexibility of 
versioning at higher-layers such as the file system. By 
transparently versioning data at the disk-level, we make 
data recoverable even in the event of OS compromises. 
However, today’s disks lack information about higher- 
level abstractions of data (such as files and directories), 
and hence cannot support flexible versioning granulari- 
ties. To solve this problem, we leverage Type-Safe Disks 
(TSDs) [16] and exploit higher-level data semantics at 
the disk-level. 

Type-safe disks export an extended block-based in- 
terface to file systems. In addition to the regular 
block read and write primitives exported by traditional 
disks, TSDs support pointer management primitives that 
can be used by file systems to communicate pointer- 
relationships between disk blocks. For example, an Ext2 
file system can communicate the relationships between 
an inode block of a file and its corresponding data blocks. 
Through this, logical abstractions of most file systems 
can be encoded and communicated to the disk system. 
Figure 1 shows the on-disk layout of Ext2. As seen 
from Figure 1, files and directories can be identified us- 
ing pointers by just enumerating blocks of sub-trees with 
inode or directory blocks as root. 

The overall goals of SVSDS are the following: 


e Perform block versioning at the disk-level in a com- 
pletely transparent manner such that higher-level 
software (such as file systems or user applications) 
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Figure 1: Pointer relationship inside an FF‘S-like file sys- 
tem 


cannot bypass it. System administrators or users 
can set up versioning policies or revert and delete 
versions through an offline privileged channel after 
a capability-based authentication process enforced 
by the disk system. 


e Aggressively version all meta-data (e.g., Ext2 inode 
blocks) and chosen data as per the policies set up by 
administrators or users. In the perspective of a file 
system, versioning policies must be at granularities 
of individual files or directories. 


e Enforce basic constraints at the disk-level, such as 
read-only and append-only. Users must be able to 
choose specific files or directories to be protected 
by these constraints. 


Figure 2 shows the overall architecture of SVSDS. The 
three major components in SVSDS are, (1) Storage virtu- 
alization Layer (SVL), (2) The Version Manager, and (3) 
The Constraint Manager. The SVL virtualizes the block 
address space and manages physical space on the device. 
The version manager automatically versions meta-data 
and user-selected files and directories. It also provides 
an interface to revert back the disk state to previous ver- 
sions. The constraint manager enforces read-only and 
append-only operation-level constraints on files and di- 
rectories inside the disk. 

The rest of this section is organized as follows. Sec- 
tion 4.1 describe how transparent versioning is per- 
formed inside SVSDS. Section 4.2 talks about the ver- 
sioning mechanism. Section 4.4 describes our recov- 
ery mechanism and how an administrator recovers af- 
ter detecting an OS intrusion. Section 4.5 describes how 
SVSDS enforces operation based constraints on files and 
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Figure 2: Architecture of SVSDS 


directories. Finally, in Section 4.6, we discuss some of 
the issues with SVSDS. 


4.1 Transparent Versioning 


Transparent versioning is an important requirement, as 
SVSDS has to ensure that the versioning mechanism is 
not bypassed by higher layers. To provide transparent 
versioning, the storage virtualization layer (SVL) virtu- 
alizes the disk address space. The SVL splits the disk ad- 
dress space into two: logical and physical, and internally 
maintains the mapping between them. The logical ad- 
dress space is exposed to file systems and the SVL trans- 
lates logical addresses to physical ones for every disk 
request. This enables SVL to transparently change the 
underlying physical block mappings when required, and 
applications are completely oblivious to the exact physi- 
cal location of a logical block. 


SVSDS maintains T-TABLE (or translation table), 
to store the relationship between logical and physical 
blocks. There is a one-to-one relationship between each 
logical and physical block in the T-TABLE. A version 
number field is also added to each entry of T-TABLE to 
denote the last version in which a particular block was 
modified. Also, a status flag is added to each T-TABLE 
entry to indicate the type (meta-data or data), and sta- 
tus (versioned or non-versioned) of each block. The T- 
TABLE is indexed by the logical block number and every 
allocated block has an entry in the T-TABLE. When ap- 
plications read (or write) blocks, the SVL looks up the 
T-TABLE for the logical block and redirects the request to 
the corresponding physical block stored in the T-TABLE 
entry. 
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Free-Space Management SVSDS has two different 
address spaces, whereas the regular TSDs only have one. 
Hence, SVSDS cannot reuse the existing block alloca- 
tion mechanism of regular TSDs. To manage both ad- 
dress spaces, the SVL uses two different bitmaps: log- 
ical block bitmaps (LBITMAPS) in addition to the exist- 
ing physical block bitmaps (PBITMAPS). SVSDS uses 
a two-phased block allocation process. During the first 
phase, the SVL allocates the requested number of physi- 
cal blocks from PBITMAPS. The allocation request need 
not always succeed as some of the physical blocks are 
used for storing the previous versions of blocks. If the 
physical block allocation request succeeds, it proceeds 
to the next phase. In the second phase, the SVL allocates 
an equal number of logical blocks from LBITMAPS. It 
then associates each of the newly allocated logical block 
with a physical block and adds an entry in the T-TABLE 
for each pair. The flags for these new entries are copied 
from the reference block passed to the ALLOC_BLOCK 
call and the version number is copied from the disk main- 
tained version number. This ensures that all blocks that 
are added later to a file inherit the same attributes (or 
flags) as their parent block. 


4.2 Creating versions 


The version manager is responsible for creating new ver- 
sions and maintaining previous versions of data on the 
disk. The version manager provides the flexibility of file- 
system-level versioning while operating inside the disk. 
By default, it versions all meta-data blocks. In addi- 
tion, it can also selectively version user-selected files and 
directories. The version manager automatically check- 
points the meta-data and chosen data blocks at regular 
intervals of time, and performs copy-on-write upon sub- 
sequent modifications to the data. The version manager 
maintains a global version number and increments it af- 
ter every checkpoint interval. The checkpoint interval is 
the time interval after which the version number is au- 
tomatically incremented by the disk. SVSDS allows an 
administrator to specify the checkpoint interval through 
its administrative interface. 

The version manager maintains a table, V-TABLE (or 
version table), to keep track of previous versions of 
blocks. For each version, the V-TABLE has a separate 
list of logical-to-physical block mappings for modified 
blocks. 

Once the current version is checkpointed, any subse- 
quent write to a versioned block creates a new version for 
that block. During this write, the version manager also 
backs up the existing logical to physical mapping in the 
V-TABLE. To create a new version of a block, the version 
manger allocates a new physical block through the SVL, 
changes the corresponding logical block entry in the T- 
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TABLE to point to the newly allocated physical block, and 
updates the version number of this entry to the current 
version. Figure 3 shows a V-TABLE with a few entries 
in the mapping list for the first three versions. Let’s take 
a simple example to show how entries are added to the 
V-TABLE. If block 3 is overwritten in version 2, the entry 
in the T-TABLE for block 3 is added to the mapping list 
of the previous version (i.e., version 1). 


Versioning TSD Pointer Structures TSDs maintains 
their own pointer structures inside the disk to track block 
relationships. The pointer management in TSDs was ex- 
plained in Section 2.2. The pointers refers to the disk- 
level pointers inside TSDs, unless otherwise mentioned 
in the paper. As pointers are used to track block live- 
ness information inside TSDs, the disk needs to keep its 
pointer structures up to date at all times. When the disk is 
reverted back to the previous version, the pointer opera- 
tions performed in the current version have to be undone 
for the disk to reclaim back the space used by the current 
version. 

To undo the pointer operations, SVSDS logs all 
pointer operations to the pointer operation list of the cur- 
rent version in the V-TABLE. For example, in Figure 3 the 
first entry in the pointer operation list for version 1 shows 
that a pointer was created between logical blocks 3 and 8. 
This create pointer operation has to be undone when the 
disk is reverted back from version | to 0. Similarly, the 
first entry in the pointer operation list for version 3 de- 
notes that a pointer was deleted between logical blocks 3 
and 8. This operation has to be undone when the disk is 
reverted back from version 3 to version 2. 

To reduce the space required to store the pointer opera- 
tions, SVSDS does not store pointer operations on blocks 
created and deleted (or deleted and created) within the 
same version. When a CREATE-PTR is issued with source 
a and destination b in version x. During the lifetime of 
the version x, if a DELETE_PTR operation is called with 
the same source a and destination b, then the version 
manager removes the entry from the pointer operation 
list for that version in the V-TABLE. We can safely re- 
move these pointer operations because CREATE-PTR and 
DELETE-PTR operations are the inverse of each other and 
would cancel out their changes when they occur with- 
ing the same version. The recovery manager maintains a 
hash table indexed on the source and destination pair for 
efficient retrieval of entries from the V-TABLE. 


4.3 Selective Versioning 


Current block-based disk systems lack semantic infor- 
mation about the data being stored inside. As a result, 
disk-level versioning systems [7,23] version all blocks. 
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But versioning all blocks inside the disk can quickly con- 
sume all available free space on the disk. Also, version- 
ing all blocks is not efficient for the following two rea- 
sons: (1) short lived temporary data (e.g., data in the /tmp 
folder and installation programs) need not be versioned, 
and (2) persistent data blocks have varying levels of im- 
portance. For example, in FFS-like file systems, version- 
ing the super block, inode blocks, or indirect blocks is 
more important than versioning data blocks as the for- 
mer affects the reachability of other blocks stored inside 
the disk. Hence, SVSDS selectively versions meta-data 
and user-selected files and directories to provide deeper 
version histories. 


Versioning meta-data. Meta-data blocks have to be 
versioned inside the disk for two reasons. First, reach- 
ability: meta-data blocks affects the reachability of data 
blocks that it points to (e.g., the data blocks can only be 
reached through the inode or the indirect block). Sec- 
ond, recovery of user-selected files: we need to preserve 
all versions of the entire file system directory-structure 
inside the disk to revert back files and directories. 

To selectively version meta-data blocks, SVSDS 
uses the pointer information available inside the TSDs. 
SVSDS identifies a meta-data block during the first CRE- 
ATE-PTR operation the block passed as the source is iden- 
tified as a meta-data block. For all source block passed 
to the CREATE_PTR operation, SVSDS marks it as meta- 
data in the T-TABLE. 

SVSDS defers reallocation of deleted data blocks until 
there are no free blocks available inside the disk. This 
ensures that for a period of time the deleted data blocks 
will still be valid and can be restored back when their 
corresponding meta-data blocks are reverted back during 
recovery. 

To version files and directories, applications issue an 
ioctl to the file system that uses SVSDS. The file sys- 
tem in turn locates the logical block number of the file’s 
inode block, and calls the VERSION_-BLOCKS disk prim- 
itive. VERSION_BLOCKS is a new primitive added to the 
existing disk interface for applications to communicate 
the files for versioning (see Table 1). After the blocks of 
the file are marked for versioning, the disk automatically 
versions the marked blocks at regular intervals. 


Versioning user-selected data. Versioning meta-data 
blocks alone does not make the disk system more se- 
cure. Users still want the disk to automatically version 
certain files and directories. To selectively version files 
and directories, applications and file systems only have 
to pass the starting block (or the root of the subtree) un- 
der which all the blocks needs to be versioned. For ex- 
ample, in Ext2 only the inode block of the file or the di- 
rectory needs to be passed for versioning. SVSDS does 
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Figure 3: The v-table data structure. A simplified v-table state is shown for first three versions in SVSDS. Each entry 
in the old mapping list corresponds to logical and physical block pair. C & D in the pointer operation list represent 


Create pointer and Delete pointer operations, respectively. 


a Breadth First Search (BFS) on the P-TABLE, starting 
from the root of the subtree. All the blocks traversed dur- 
ing the BFS are marked for versioning in the T-TABLE. 


One common issue in performing BFS is that there 
could potentially be many cycles in the graph that is be- 
ing traversed. For example, in the Ext2TSD [16] file 
system, there is a pointer from the inode of the direc- 
tory block, to the inode of the sub-directory block and 
vice versa. Symbolic links are yet another source of cy- 
cles. SVSDS detects cycles by maintaining a hash table 
(D-TABLE) for blocks that have been visited during the 
BFS. During each stage of the BFS, the version manager 
checks to see if the currently visited node is present in the 
D-TABLE before traversing the blocks pointed to by this 
block. If the block is already present in the D-TABLE, 
SVSDS skips the block as it was already marked for ver- 
sioning. If not, SVSDS adds the currently visited block 
to the D-TABLE before continuing with the BFS. 


To identify blocks that are subsequently added to 
versioned files or directories, SVSDS checks the flags 
present in the T-TABLE of the source block during the 
CREATE-PTR operations. This is because when file sys- 
tems want to get a free block from SVSDS, they is- 
sue an ALLOC_BLOCK call with a reference block and 
the number of required blocks as arguments. This 
ALLOC_BLOCK call is internally translated to a CRE- 
ATE_PTR operation with the reference block and the 
newly allocated block as its arguments. If the reference 
block is marked to be versioned, then the destination 
block that it points to is also marked for versioning. File 
systems normally pass the inode or the indirect block as 
the reference block. 
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4.4 Reverting Versions 


In the event of an intrusion or an operating system 
compromise, an administrator would want to undo the 
changes done by an intruder or a malicious application 
by reverting back to a previous safe state of the disk. We 
define reverting back to a previous versions as restoring 
the disk state from time ¢ to the disk state at time t - tv, 
where tv is the checkpoint interval. 

Even though SVSDS can access any previous ver- 
sion’s data, we require reverting only one version at a 
time. This is because SVSDS internally maintains state 
about block relationships through pointers, and it re- 
quires that the pointer information be properly updated 
inside the disk to garbage-collect deleted blocks. To il- 
lustrate the problem with reverting back to an arbitrary 
version, let’s revert the disk state from version f to ver- 
sion a by skipping reverting of the versions between f 
and a. Reverting back the V-TABLE entries for version 
a alone would not suffice. As we directly jump to ver- 
sion a, the blocks that were allocated, and pointers that 
were created or deleted between versions f and a, are 
not reverted back. The blocks present during version a 
does not contain information about blocks created after 
version a. As a result, blocks allocated after version a 
becomes unreachable by applications but according to 
pointer information in the P-TABLE they are still reach- 
able. As a result, the disk will not reclaim back these 
block and the we will be leaking disk space. Hence, 
SVSDS allows an administrator to revert back only one 
version at a time. 

SVSDS also allows an administrator to revert back 
the disk state to a arbitrary point in time by revert- 
ing back one version at a time until the largest ver- 
sion whose start time is less than or equal to the 
time mentioned by the administrator is found. RE- 
VERT_TO_PREVIOUS_VERSION and REVERT_TO_TIME 
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Disk Primitives Description 
VERSION_BLOCKS(BNo) 


REVERT_TO_PREVIOUS_VERSION 
REVERT_TO_TIME(t) 


MARK-READ_ONLY(BWNo) 


MARK-_APPEND_ONLY(BWNo) 








Marks all blocks in the subtree starting from block BNo to be versioned. 
The data blocks present in the subtree will be versioned along with the 
reference (or meta-data) blocks. 

Reverts back the disk state from current version to the previous version. 
Reverts back the disk state one version at a time till it finds a version v 
with start time less than or equal to ¢. 

Marks all blocks in the sub-tree starting from block BNo as read-only. 
Marks all blocks in the sub-tree starting from block BNo as 
append-only. B No itself will not be an append-only block as it could be 
a meta-data block, with non-sequential updates. 





Table 1: Additional Disk APIs in SVSDS 


are the additional primitives added to the existing disk 
interface to revert back versions by the administrator (see 
Table 1). 

While reverting back to a previous version, SVSDS 
recovers the data by reverting back the following: (1) 
Pointers: the pointer operation that happened in the cur- 
rent version are reverted back; (2) Meta-data: all meta- 
data changes that happened in the current version are re- 
verted back; (3) Data-blocks: all versioned data blocks 
and some (or all) of the non-versioned deleted data- 
blocks are reverted back (i.e., the non-versioned data 
blocks that have been garbage collected cannot be re- 
verted back); and (4) Bitmaps: both logical and physical 
block bitmap changes that happened during the current 
version are reverted. 


4.4.1 Reverting Mapping 


SVSDS reverts back to its previous version from the cur- 
rent version in two phases. In the first phase, it restores 
all the T-TABLE entries stored in the mapping list of the 
previous version in the V-TABLE. While restoring back 
the T-TABLE entries of the previous version, there are two 
cases that need to be handled. (1) An entry already ex- 
ists in the T-TABLE for the logical block of the restored 
mapping. (2) An entry does not exist. When an entry 
exists in the T-TABLE, the current mapping is replaced 
with the old physical block from the mapping list in the 
V-TABLE. The current physical block is freed by clearing 
the bit corresponding to the physical block number in the 
PBITMAPS. If an entry does not exist in the T-TABLE, it 
implies that the block was deleted in the current version 
and the mapping was backed up in the V-TABLE. SVSDS 
restores the mapping as a new entry in the T-TABLE and 
the logical block is marked as used in the LBITMAPS. 
The physical block need not be marked as used as it is 
already alive. At the end of the first phase, SVSDS re- 
stores back all the versioned data that got modified or 
deleted in the current version. 
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4.4.2 Reverting Pointer Operations 


In the second phase of the recovery process, SVSDS re- 
verts back the pointer operations performed in the cur- 
rent version by applying the inverse of the pointer op- 
erations. The inverse of the CREATE_PTR operation is a 
DELETE-PTR operation and vice versa. The pointer op- 
erations are reverted back to free up the space used by 
blocks created in the current version and also for restor- 
ing pointers deleted in the current version. 

Reverting back CREATE-PTR operations are straight 
forward. SVSDS issues the corresponding DELETE_PTR 
operations. If there are no incoming pointers to the des- 
tination blocks of the DELETE_PTR operations, the disk 
automatically garbage collects the destination blocks. 

While reverting the DELETE_PTR operations, SVSDS 
checks if the destination blocks are present in the T- 
TABLE. If yes, SVSDS executes the corresponding CRE- 
ATE-PTR operations. If the destination blocks is not 
present in the T-TABLE, it implies that the DELETE_PTR 
operations were performed on non-versioned blocks. If 
the destination blocks are present in the deleted block 
list, SVSDS restores the backed up T-TABLE entries from 
the deleted block list and issues the corresponding CRE- 
ATE-PTR operations. 

While reverting back to a previous version, the inverse 
pointer operations have to be replayed in the reverse or- 
der. If not, SVSDS would prematurely garbage collect 
these blocks. We illustrate this problem with a simple 
example. From Figure 4(a) we can see that block a has 
a pointer to block b and block 6 has pointers to blocks c 
and d. The pointers from b are first deleted and then the 
pointer from a to b is deleted. This is shown in Figs. 4(b) 
and 4(c). If the inverse pointer operations are applied in 
the same order, first a pointer would be is created from 
block b to d (assuming pointer from b to d is deleted first) 
but block b would be automatically garbage collected by 
SVSDS as there are no incoming pointers to block b. Re- 
playing pointer operations in the reverse order avoids this 
problem. Figs 4(d), 4(e), and 4(f) show the sequence of 
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Figure 4: Steps in reverting back delete pointer operations 


steps performed while reverting back the delete pointer 
operations in the reverse order. We can see that revert- 
ing back pointer operations in the reverse order correctly 
reestablishes the pointers in the correct sequence. 


4.4.3 Reverting Meta-Data 


SVSDS uses the mapping information in the V-TABLE to 
revert back changes to the meta-data blocks. There are 
three cases that need to be handled while reverting back 
meta-data blocks: (1) The meta-data block is modified 
in the new version, (2) The meta-data block is deleted 
in the new version, and (3) The meta-data block is first 
modified and then deleted in the new version. In the first 
case, the mappings that are backed up in the previous 
version for the modified block in the V-TABLE are re- 
stored. This is done to get back the previous contents 
of the meta-data blocks. For the second case, the delete 
pointer operations would have caused the T-TABLE en- 
tries to be backed up in the V-TABLE as they would be 
the last incoming pointer to the meta-data blocks. The 
T-TABLE entries will be restored back in the first phase 
of the recovery process and the deleted pointers are re- 
stored back in the second phase of the recovery process. 
Reverting meta-data blocks when they are first modified 
and then deleted is the same as in reverting meta-data 
blocks when they are deleted. 


4.4.4 Reverting Data Blocks 


When the recovery manager reverts back to a previous 
version, it cannot revert back to the exact disk state in 
most cases. To revert back to the exact disk state, the disk 
would need to revert mappings for all blocks, including 
the data blocks that are not versioned by default. In a 
typical TSD scenario, blocks are automatically garbage 
collected as soon as the last incoming pointer to them 
is deleted, making their recovery difficult if not impos- 
sible. The garbage collector in SVSDS tries to reclaim 
the deleted data blocks as late as possible. To do this, 
SVSDS maintains an LRU list of deleted non-versioned 
blocks (also known as the deleted block list). 

When the delete-pointer operations are reverted back, 
SVSDS issues the corresponding create-pointer opera- 
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tions only if the deleted data blocks are still present in 
the deleted block list. This policy of lazy garbage collec- 
tion allows users to recover the deleted data blocks that 
have not yet been garbage collected yet. 

Lazy garbage collection is also useful when a user re- 
verts back the disk state after inadvertently deleting a di- 
rectory. If all data blocks that belong to the directory are 
not garbage collected, then the user can get back the en- 
tire directory along with the files stored under it. If some 
of the blocks are already reclaimed by the disk, the user 
would get back the deleted directory with data missing in 
some files. Even though SVSDS does not version all data 
block, it still tries to restore back all deleted data blocks 
when disk is revert back to its previous version. 


4.4.5 Reverting Bitmaps 


When data blocks are added or reclaimed back during 
the recovery process the bitmaps have to be adjusted to 
keep track of free blocks. The PBITMAPS need not be 
restored back as they are never deleted. The physical 
blocks are backed up either in the deleted block list or 
in the old mapping lists in the V-TABLE. The physical 
blocks that are added in the current version are freed dur- 
ing the first and second phases of the recovery process. 
During the first phase, the previous version’s data is re- 
stored from mapping list in the V-TABLE. At this time the 
physical blocks of the newer version are marked free in 
the PBITMAPS. When the pointers created in the current 
version are reverted back by deleting them in the second 
phase, the garbage collector frees both the physical and 
the logical blocks, only if it is the last incoming pointer 
to the destination block. 

The LBITMAPS only have to be restored back for ver- 
sioned blocks that have been deleted in the current ver- 
sion. While restoring the backed up mappings from the 
V-TABLE, SVSDS checks if the logical block is allocated 
in the LBITMAPS. If it is not allocated, SVSDS reallo- 
cates the deleted logical block by setting the correspond- 
ing bit in the LBITMAPS. The deleted non-versioned 
blocks need not be restored back. Previously, these 
blocks were moved to the deleted block list and were 
added back to the T-TABLE during the second phase of 
the recovery process. 
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4.5  Operation-based constraints 


In addition to versioning data inside the disk, it is also 
important to protect certain blocks from being modified, 
overwritten, or deleted. SVSDS allows users to spec- 
ify the types of operations that can be performed on a 
block, and the constraint manager enforces these con- 
straints during block writes. SVSDS enforces two types 
of operation-based constraints: read-only and append- 
only. 

The sequence of steps taken by the operation man- 
ager to mark a file as read-only or append-only is the 
same as marking a file to be versioned. The steps for 
marking a file to be versioned was described in Sec- 
tion 4.3. While marking a group of blocks, the first 
block (or the root block of the subtree) encountered in 
the breadth first search is treated differently to accom- 
modate special file system updates. For example, file 
systems under UNIX support three timestamps: access 
time (atime), modification time (mtime), and creation 
time (ctime). When data from a file is read, its atime 
is updated in the file’s inode. Similarly, when the file 
is modified, its mtime and ctime are updated in its in- 
ode. To accommodate atime, mtime, and ctime updates 
on the first block, the constraint manager distinguishes 
the first block by adding a special meta-data block flag 
in the T-TABLE for the block. SVSDS disallows dele- 
tion of blocks marked as read-only or append-only con- 
straints. MARK-READ_ONLY and MARK-APPEND_ONLY 
are the two new APIs that have been added to the disk for 
applications to specify the operation-based constraints on 
blocks stored inside the disk. These APIs are described 
in Table 1. 


Read-only constraint. The read-only operation-based 
constraint is implemented to make block(s) immutable. 
For example, the system administrator could mark bi- 
naries or directories that contain libraries as read-only, 
so that later on they are not modified by an intruder or 
any other malware application. Since SVSDS does not 
have information about the file system data structures, 
atime updates cannot be distinguished from regular block 
writes using pointer information. SVSDS neglects (or 
disallows) the atime updates on read-only blocks, as they 
do not change the integrity of the file. Note that the read- 
only constraint can also be applied to files that are rarely 
updated (such as binaries). When such files have to be 
updated, the read-only constraint can be removed and set 
back again by the administrator through the secure disk 
interface. 


Append-only constraint. Log files serve as an impor- 
tant resource for intrusion analysis and statistics collec- 
tion. The results of the intrusion analysis is heavily de- 
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pendent on the integrity of the log files. The operation- 
based constraints implemented by SVSDS can be used 
to protect log files from being overwritten or deleted by 
intruders. 

SVSDS allows marking any subtree in the pointer 
chain as “append-only”. During a write to a block in 
an append-only subtree, the operation manager allows 
it only if the modification is to change trailing zeroes 
to non-zeroes values. SVSDS checks the difference be- 
tween the original and the new contents to verify that 
data is only being appended, and not overwritten. To 
improve the performance, the operation manager caches 
the append-only blocks when they are written to the disk 
to avoid reading the original contents of block from the 
disk during comparison. If a block is not present in the 
cache, the constraint manager reads the block and adds 
it to the cache before processing the write request. To 
speed up comparisons, the operation manager also stores 
the offsets of end of data inside the append-only blocks. 
The newly written data is compared with the cached data 
until the stored offsets. 

When data is appended to the log file, the atime and 
the mtime are also updated in the inode block of the file 
by the file system. As a result, the first block of the 
append-only block is overwritten with every update to 
the file. As mentioned earlier, SVSDS does not have the 
information about the file system data structures. Hence, 
SVSDS permits the first block of the append-only files to 
be overwritten by the file system. 

SVSDS does not have information about how file 
systems organize its directory data. Hence, enforcing 
append-only constraints on directories will only work iff 
the new directory entries are added after the existing en- 
tries. This also ensures that files in directories marked as 
append-only cannot be deleted. This would help in pre- 
venting malicious users from deleting a file and creating 
a symlink to a new file (for example, an attacker can no 
longer unlink a critical file like /etc/passwd, and then just 
creates a new file in its place). 


4.6 Issues 


In this section, we talk about some of the issues with 
SVSDS. First we talk about the file system consistency 
after reverting back to a previous version inside the disk. 
We then talk about the need for a special port on the disk 
to provide secure communication. Finally, we talk about 
Denial of Service (DoS) attacks and possible solutions to 
overcome them. 


Consistency Although TSDs understand a limited 
amount of file system semantics through pointers, they 
are still oblivious to the exact format of file system- 
specific meta-data and hence it cannot revert the state that 
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is consistent in the viewpoint of specific file systems. A 
file system consistency checker (e.g., fsck) needs to be 
run after the disk is reverted back to a previous version. 
Since SVSDS internally uses pointers to track blocks, the 
consistency checker should also issue appropriate calls to 
SVSDS to ensure that disk-level pointers are consistent 
with file system pointers. 


Administrative Interfaces To prevent unauthorized 
users from reverting versions inside the disk, SVSDS 
should have a special hardware interface through which 
an administrator can log in and revert back versions. 
This port can also be used for setting the checkpoint fre- 
quency. 


Supporting Encryption File Systems Encryption File 
systems (EFS) can run on top of SVSDS with minimal 
modifications. SVSDS only requires EFS to use TSD’s 
API for block allocation and notifying pointer relation- 
ship to the disk. The append-only operation-based con- 
straint would not work for EFS as end of block can- 
not be detected if blocks are encrypted. If encryption 
keys are changed across versions and if the administra- 
tor reverts back to a previous version, the decryption of 
the file would no longer work. One possible solution is 
to change the encryption keys of files after a capability 
based authentication upon which SVSDS would decrypt 
all the older versions and re-encrypt them with the newly 
provided keys. The disadvantage with this approach is 
that the versioned blocks need to be decrypted and re- 
encrypted when the keys are changed. 


DoS Attacks SVSDS is vulnerable to denial of service 
attacks. There are three issues to be handled: (1) blocks 
that are marked for versioning could be repeatedly over- 
written; (2) lots of bogus files could be created to delete 
old versions, and (3) versioned files could be deleted and 
recreated again preventing subsequent modifications to 
files from being versioned inside the disk. To counter at- 
tacks of type 1, SVSDS can throttle writes to files that 
are versioned very frequently. An alternative solution to 
this problem would be to exponentially increase the ver- 
sioning interval of the particular file / directory that is 
being constantly overwritten resulting in fewer number 
of versions for the file. As with most of the denial of 
service attacks there is no perfect solution to attack of 
type 2. One possible solution would be to stop further 
writes to the disk, until some of the space used up by 
older versions, are freed up by the administrator through 
the administrative interface. The downside of this ap- 
proach is that the disk effectively becomes read-only till 
the administrator frees up some space. Type 3 attacks are 
not that serious as versioned files are always backed up 
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when they are deleted. One possible solution to prevent 
versioned files from being deleted is to add no-delete flag 
on the inode block of the file. This flag would be checked 
by SVSDS along with other operation-based constraints 
before deleting/modifying the block. The downside of 
this approach is that normal users can no longer delete 
versioned files that have been marked as no-delete. The 
administrator has to explicitly delete this flag on the no- 
delete files. 


5 Implementation 


We implemented a prototype SVSDS as a pseudo-device 
driver in Linux kernel 2.6.15 that stacks on top of an 
existing disk block driver. Figure 5 shows the pseudo 
device driver implementation of SVSDS. SVSDS has 
7,487 lines of kernel code out of which 3,060 were 
reused from an existing TSD prototype. The SVSDS 
layer receives all block requests from the file system, 
and re-maps and redirects the common read and write 
requests to the lower-level device driver. The additional 
primitives required for operations such as block alloca- 
tion and pointer management are implemented as driver 
ioctls. 
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Figure 5: Prototype Implementation of SVSDS 


In the current implementation we maintain all hash ta- 
bles (V-TABLE, T-TABLE, P-TABLE, and D-TABLE) as in- 
memory data structures. As these hash tables only have 
small space requirements, they can be persistently stored 
in a portion of the NVRAM inside the disk. This helps 
SVSDS to avoid disk I/O for reading these tables. 

The read and write requests from file systems reach 
SVSDS through the Block IO (BIO) layer in the Linux 
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kernel. The BIO layer issues I/O requests with the des- 
tination block number, callback function (BILEND_IO), 
and the buffers for data transfer, embedded inside the 
BIO data structure. To redirect the block requests from 
SVSDS to the underlying disk, we add a new data struc- 
ture (BACKUP_BIO_DATA). This structure stores the des- 
tination block number, BILEND_IO, and BI_PRIVATE of 
the BIO data structure. The BI_PRIVATE field is used 
by the owner of the BIO request to store private infor- 
mation. As I/O request are by default asynchronous 
in the Linux kernel, we stored the original contents of 
the BIO data structures by replacing the value stored 
inside BI_PRIVATE to point to our BACKUP_BIO_DATA 
data structure. When I/O requests reach SVSDS, we 
replace the destination block number, BILEND-_IO, and 
BI_PRIVATE in the BIO data structure with the mapped 
physical block from the T-TABLE, our callback func- 
tion (SVSDS_END_IO), and the BACKUP_BIO_DATA re- 
spectively. Once the I/O request is completed, the con- 
trol reaches our SVSDS_END-_IO function. In this func- 
tion, we restore back the original block number and 
BI_PRIVATE information from the BACKUP_BIO_DATA 
data structure. We then call the BILEND_IO function 
stored in the BACKUP_BIO_DATA data structure, to notify 
the BIO layer that the I/O request is now complete. 

We did not make any design changes to the ex- 
isting Ext2TSD file system to support SVSDS. The 
Ext2TSD is a modified version of the Ext2 file sys- 
tem that notifies the pointer relationship to the file sys- 
tem through the TSD disk APIs. To enable users to 
select files and directories for versioning or enforcing 
operation-based constraints, we have added three ioctls 
namely: VERSION_FILE, MARK-FILE-_READONLY, and 
MARK-_FILE_APPENDONLY to the Ext2TSD file system. 
All three ioctls take a file descriptor as their argument, 
and gets the inode number from the in-memory inode 
data structure. Once the Ext2TSD file system has the 
inode number of the file, it finds the the logical block 
number that correspond to inode number of the file. Fi- 
nally, we call the the corresponding disk primitive from 
the file system ioctl with logical block number of the in- 
ode as the argument. Inside the disk primitive we mark 
the file’s blocks for versioning or enforcing operation- 
based constraint by performing a breadth first search on 
the P-TABLE. 


6 Evaluation 


We evaluated the performance of our prototype SVSDS 
using the Ext2TSD file system [16]. We ran general- 
purpose workloads on our prototype and compared them 
with unmodified Ext2 file system on a regular disk. This 
section is organized as follows: In Section 6.1, we talk 
about our test platform, configurations, and procedures. 
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Section 6.2 analyzes the performance of the SVSDS 
framework for an I/O-intensive workload, Postmark [8]. 
In Sections 6.3 and 6.4 we analyze the performance on 
OpenSSH and kernel compile workloads respectively. 


6.1 Test infrastructure 


We conducted all tests on a 2.8GHz Intel Xeon CPU with 
1GB RAM, and a 74GB 10Krpm Ultra-320 SCSI disk. 
We used Fedora Core 6 running a vanilla Linux 2.6.15 
kernel. To ensure a cold cache, we unmounted all in- 
volved file systems between each test. We ran all tests at 
least five times and computed 95% confidence intervals 
for the mean elapsed, system, user, and wait times using 
the Student-t distribution. In each case, the half-widths 
of the intervals were less than 5% of the mean. Wait time 
is the difference between elapsed time and CPU time, 
and is affected by I/O and process scheduling. 

Unless otherwise mentioned, the system time over- 
heads were mainly caused by the hash table lookups 
on T-TABLE during the read and write operations and 
also due to P-TABLE lookups during CREATE_-PTR and 
DELETE_PTR operations. This CPU overhead is due to 
the fact that our prototype is implemented as a pseudo- 
device driver that runs on the same CPU as the file sys- 
tem. Ina real SVSDS setting, the hash table lookups will 
be performed by the processor embedded in the disk and 
hence will not influence the overheads on the host sys- 
tem, but will add to the wait time. 

We have compared the overheads of SVSDS using 
Ext2TSD against Ext2 on a regular disk. We denote 
Ext2TSD on a SVSDS using the name Ext2Ver. The let- 
ters md and all are used to denote selective versioning 
of meta-data and all data respectively. 


6.2 Postmark 


Postmark [8] simulates the operation of electronic mail 
and news servers. It does so by performing a series of 
file system operations such as appends, file reads, direc- 
tory lookups, creations, and deletions. This benchmark 
uses little CPU but is I/O intensive. We configured Post- 
mark to create 3,000 files, between 100—200 kilobytes, 
and perform 300,000 transactions. 

Figure 6 show the performance of Ex2TSD on SVSDS 
for Postmark with a versioning interval of 30 seconds. 
Postmark deletes all its files at the end of the benchmark, 
so no space is occupied at the end of the test. SVSDS 
transparently creates versions and thus, consumes stor- 
age space which is not visible to the file system. The av- 
erage number of versions created during this benchmark 
is 27. 

For Ext2TSD, system time is observed to be 1.1 times 
more, and wait time is 8% lesser that of Ext2. The 
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Figure 6: Postmark results for SVSDS 


increase in the system time is because of the hash ta- 
ble lookups during CREATE_PTR and DELETE_PTR calls. 
The decrease in the wait time is because, Ext2TSD does 
not take into account future growth of files while allocat- 
ing space for files. This decrease in wait time allowed 
Ext2TSD to perform slight better than Ext2 file system 
on a regular disk, but would have had a more significant 
impact in a benchmark with files that grow. 

For Ext2Ver(md), elapsed time is observed to have no 
overhead, system time is 4 times more and wait time is 
20% less than that of Ext2. The increase in system time 
is due to the additional hash table lookups to locate en- 
tries in the T-TABLE. The decrease in wait time is due to 
better spacial locality and increased number of requests 
being merged inside the disk. This is because the ran- 
dom writes (i.e., writing inode block along with writing 
the newly allocated block) were converted to sequential 
writes due to copy-on-write in versioning. 

For Ext2Ver(all), The system time is 4 times more and 
wait time is 20% less that of Ext2. The wait time in 
Ext2Ver(all) does not have any observable overhead over 
the wait time in Ext2Ver(md). Hence, it is not possible 
to explain for the slight increase in the wait time. 


6.3 OpenSSH Compile 


To show the space overheads of a typical program in- 
staller, we compiled the OpenSSH source code. We used 
OpenSSH version 4.5, and analyzed the overheads of 
Ext2 on a regular disk, Ext2TSD on a TSD, and meta- 
data and all data versioning in Ext2TSD on SVSDS 
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for the untar, configure, and make stages combined. 
Since the entire benchmark completed in 60-65 seconds, 
we used a 2 second versioning interval to create more 
versions of blocks. On an average, 10 versions were 
created. This is because the pdflush deamon starts writ- 
ing the modified file system blocks to disk after 30 sec- 
onds. As a result, the disk does not get any write request 
for blocks during the first 30 seconds of the OpenSSH 
Compile benchmark. The amount of data generated by 
this benchmark was 16MB. The results for the OpenSSH 
compilation are shown in Figure 7. 
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Figure 7: OpenSSH Compile Results for SVSDS 


For Ext2TSD, we recorded a insignificant increase in 
elapsed time and system time, and a 108% increase in the 
wait time over Ext2. Since the elapsed and system times 
are similar, it is not possible to quantify for the increase 
in wait time. 


For Ext2Ver(md), we recorded a 7% increase in 
elapsed time, and a 41% increase in system time over 
Ext2. The increase in system time overhead is due to the 
additional hash table lookups by SVL to remap the read 
and write requests. Ext2Ver(md) consumed 496KB of 
additional disk space to store the versions. 


For Ext2Ver(all), we recorded a 7% increase in 
elapsed time, and a 39% increase in system time over 
Ext2. Ext2Ver(all) consumes 15MB of additional space 
to store the versions. The overhead of storing versions 
is 95%. From this benchmark, we can clearly see that 
the versioning all data inside the disk is not very useful, 
especially for program installers. 
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6.4 Kernel Compile 


To simulate a CPU-intensive user workload, we com- 
piled the Linux kernel source code. We used a vanilla 
Linux 2.6.15 kernel and analyzed the overheads of 
Ext2TSD on a TSD and Ext2TSD on SVSDS with ver- 
sioning of all blocks and selective versioning of meta- 
data blocks against regular Ext2, for the untar, make 
oldconfig, and make operations combined. We used 
30 second versioning interval and 78 versions were cre- 
ated during this benchmark. The results are shown in 
Figure 8. 
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Figure 8: Kernel Compile results for SVSDS. 


For Ext2TSD, elapsed time is observed to be the same, 
system time overhead is 4% lower and wait time is lower 
by 24% than that of Ext2. The decrease in the wait time 
is because Ext2TSD does not consider future growth of 
files while allocating new blocks. 


For Ext2Ver(md), elapsed time is observed to be the 
same, system time overhead is 5%, and wait time is lower 
by 6% than that of Ext2. The increase in wait time in re- 
lation to ext2TSD is due to versioning meta-data blocks 
which affect the locality of the stored files. The space 
overhead of versioning meta-data blocks is 51 MB. 


For Ext2Ver(all), elapsed time is observed to be indis- 
tinguishable, system time overhead is 10% higher than 
that of Ext2. The increase in system time is due to the ad- 
ditional hash table lookups required for storing the map- 
ping information in the V-TABLE. The space overhead of 
versioning all blocks is 181 MB. 
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7 Related Work 


SVSDS borrows ideas from many of the previous works. 
The idea of versioning at the granularity of files has been 
explored in many file systems [6, 10,12, 15,19]. These 
file systems maintain previous versions of files primarily 
to help users to recover from their mistakes. The main 
advantage of SVSDS over these systems is that, it is de- 
coupled from the client operating system. This helps in 
protecting the versioned data, even in the event of an in- 
trusion or an operating system compromise. The virtu- 
alization of disk address space has been implemented in 
several systems [3,7,9, 13,21]. For example, the Log- 
ical disk [3] separated the file-system implementation 
from the disk characteristics by providing a logical view 
of the block device. The Storage Virtualization Layer 
in SVSDS is analogous to their logical disk layer. The 
operation-based constraints in SVSDS is a scaled down 
version of access control mechanisms. We now compare 
and contrast SVSDS with other disk-level data protection 
systems: S4 [20], TRAP [23], and Peabody [7]. 

The Self-Securing Storage System (S4) is an object- 
based disk that internally audits all requests that arrive 
at the disk. It protects data in compromised systems by 
combining log-structuring with journal-based meta-data 
versioning to prevent intruders from tampering or per- 
manently deleting the data stored on the disk. SVSDS 
on the other hand, is a block-based disk that protect data 
by transparently versioning blocks inside the disk. The 
guarantees provided by S4 hold true only during the win- 
dow of time in which it versions the data. When the disk 
runs out of storage space, S4 stops versioning data un- 
til the cleaner thread can free up space for versioning 
to continue. As S4 is designed to aid in intrusion di- 
agnosis and recovery, it does not provide any flexibility 
to users to version files (i.e, objects) inside the disk. In 
contrast, SVSDS allows users to select files and direc- 
tories for versioning inside the disk. The disadvantage 
with S4 is that, it does not provide any protection mech- 
anism to prevent modifications to stored data during in- 
trusions and always depends on the versioned data to re- 
cover from intrusions. In contrast, SVSDS attempts to 
prevent modifications to stored data during intrusions by 
enforcing operation-based constraints on system and log 
files. 

Timely Recovery to any Point-in-time (TRAP) is a 
disk array architecture that provides data recovery in 
three different modes. The three modes are: TRAP-1 
that takes snapshots at periodic time intervals; TRAP- 
3 that provides timely recovery to any point in time at 
the block device level (this mode is popularly known as 
Continuous Data Protection in storage); TRAP-4 is sim- 
ilar to RAID-5, where a log of the parities is kept for 
each block write. The disadvantage with this system is 
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that, it cannot provide TRAP-2 (data protection at the 
file-level) as their block-based disk lacks semantic infor- 
mation about the data stored in the disk blocks. Hence, 
TRAP ends up versioning all the blocks. TRAP-1 is 
similar to our current implementation where an adminis- 
trator can choose a particular interval to version blocks. 
We have implemented TRAP-2, or file-level versioning 
inside the disk as SVSDS has semantic information about 
blocks stored on the disk through pointers. TRAP-3 is 
similar to the mode in SVSDS where the time between 
creating versions is set to zero. Since SVSDS runs on 
a local disk, it cannot implement the TRAP-4 level of 
versioning. 


Peabody is a network block storage device, that vir- 
tualizes the disk space to provide the illusion of a sin- 
gle large disk to the clients. It maintains a centralized 
repository of sectors and tries to reduce the space utiliza- 
tion by coalescing blocks across multiple virtual disks 
that contain the same data. This is done to improve the 
cache utilization and to reduce the total amount of stor- 
age space. Peabody versions data by maintaining write 
logs and transaction logs. The write logs stores the pre- 
vious contents of blocks before they are overwritten, and 
the transaction logs contain information about when the 
block was written, location of the block, and the con- 
tent hashes of the blocks. The disadvantage with this ap- 
proach is that it cannot selectively versions blocks inside 
the disk. 


8 Conclusions 


Data protection against attackers with OS root privileges 
is fundamentally a hard problem. While there are nu- 
merous security mechanisms that can protect data under 
various threat scenarios, only very few of them can be ef- 
fective when the OS is compromised. In view of the fact 
that it is virtually impossible to eliminate all vulnerabil- 
ities in the OS, it is useful to explore how best we can 
recover from damages once a vulnerability exploit has 
been detected. In this paper, we have taken this direc- 
tion and explored how a disk-level recovery mechanism 
can be implemented, while still allowing flexible policies 
in tune with the higher-level abstractions of data. We 
have also shown how the disk system can enforce simple 
constraints that can effectively protect key executables 
and log files. Our solution that combines the advantages 
of a software and a hardware-level mechanism proves to 
be an effective choice against alternative methods. Our 
evaluation of our prototype implementation of SVSDS 
shows that performance overheads are negligible for nor- 
mal user workloads. 
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Future Work . Our current design supports reverting 
the entire disk state to an older version. In future, we 
plan to work on supporting more fine-grained recovery 
policies to revert specific files or directories to their older 
versions. SVSDS in its current form, relies on the admin- 
istrator to detect an intrusion and revert back to a previ- 
ously known safe state. We plan to build a storage-based 
intrusion detection system [14] inside SVSDS. Our sys- 
tem would do better than the system developed by Pen- 
nington et al. [14] as we also have data dependencies 
conveyed through pointers. We also plan to explore more 
operation-based constraints that can be supported at the 
disk-level. 


9 Acknowledgments 


We like to thank the anonymous reviewers for their help- 
ful comments. We thank Sean Callanan and Avishay 
Traeger for their feedback about the project. We would 
also like to thank the following people for their com- 
ments and suggestions on the work: Radu Sion, Rob 
Johnson, Radu Grosu, Alexander Mohr, and the mem- 
bers of our research group (File systems and Storage Lab 
at Stony Brook). 

This work was partially made possible by NSF CA- 
REER EIJA-0133589 and NSF CCR-0310493 awards. 


References 


{1] B. Berliner and J. Polk. Concurrent Versions Sys- 
tem (CVS). www.cvshome.org, 2001. 

[2] CollabNet, Inc. 
tigris.org, 2004. 

[3] W. de Jonge, M. F. Kaashoek, and W. C. Hsieh. 
The logical disk: A new approach to improving file 
systems. In Proceedings of the 19th ACM Sym- 
posium on Operating Systems Principles (SOSP 
03), Bolton Landing, NY, October 2003. ACM 
SIGOPS. 

[4] T. E. Denehy, A. C. Arpaci-Dusseau, and R. H. 
Arpaci-Dusseau. Bridging the information gap in 
storage protocol stacks. In Proceedings of the An- 
nual USENIX Technical Conference, pages 177- 
190, Monterey, CA, June 2002. USENIX Associ- 
ation. 

[5] G.R. Ganger. Blurring the Line Between OSes and 
Storage Devices. Technical Report CMU-CS-01- 
166, CMU, December 2001. 

{[6] D. K. Gifford, R. M. Needham, and M. D. 
Schroeder. The Cedar File System. Communica- 
tions of the ACM, 31(3):288-298, 1988. 

[7] C. B. Morrey II and D. Grunwald. Peabody: The 
time travelling disk. In Proceedings of the 20 th 


Subversion. nttp://subversion. 


17th USENIX Security Symposium = 273 


274 


[9 


—“ 


[10] 


[11] 


[12] 


[13] 


[14] 


[15] 


[16] 


[17] 


[18] 


IEEE/I1 th NASA Goddard Conference on Mass 
Storage Systems and Technologies (MSS’03), pages 
241-253. IEEE Computer Society, 2003. 

J. Katcher. PostMark: A new filesystem bench- 
mark. Technical Report TR3022, Network Ap- 
pliance, 1997. www.netapp.com/tech_library/3022. 
html. 

E. K. Lee and C. A. Thekkath. Petal: Distributed 
virtual disks. In Proceedings of the Seventh Inter- 
national Conference on Architectural Support for 
Programming Languages and Operating Systems 
(ASPLOS-7), pages 84—92, Cambridge, MA, 1996. 
K. McCoy. VMS File System Internals. Digital 
Press, 1990. 

M. Mesnier, G. R. Ganger, and E. Riedel. Object 
based storage. IEEE Communications Magazine, 
41, August 2003. ieeexplore.ieee.org. 

K. Muniswamy-Reddy, C. P. Wright, A. Himmer, 
and E. Zadok. A Versatile and User-Oriented Ver- 
sioning File System. In Proceedings of the Third 
USENIX Conference on File and Storage Technolo- 
gies (FAST 2004), pages 115-128, San Francisco, 
CA, March/April 2004. USENIX Association. 

D. Patterson, G. Gibson, and R. Katz. A case for 
redundant arrays of inexpensive disks (RAID). In 
Proceedings of the ACM SIGMOD, pages 109-116, 
June 1988. 

A. Pennington, J. Strunk, J. Griffin, C. Soules, 
G. Goodson, and G. Ganger. Storage-based intru- 
sion detection: Watching storage activity for suspi- 
cious behavior. In Proceedings of the 12th USENIX 
Security Symposium, pages 137-152, Washington, 
DC, August 2003. 

D. J. Santry, M. J. Feeley, N. C. Hutchinson, and 
A.C. Veitch. Elephant: The file system that never 
forgets. In Proceedings of the IEEE Workshop on 
Hot Topics in Operating Systems (HOTOS), pages 
2-7, Rio Rica, AZ, March 1999. 

G. Sivathanu, S. Sundararaman, and E. Zadok. 
Type-safe disks. In Proceedings of the 7th Sym- 
posium on Operating Systems Design and Imple- 
mentation (OSDI 2006), pages 15—28, Seattle, WA, 
November 2006. ACM SIGOPS. 

M. Sivathanu, L. N. Bairavasundaram, A. C. 
Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Life or 
death at block-level. In Proceedings of the 6th Sym- 
posium on Operating Systems Design and Imple- 
mentation (OSDI 2004), pages 379-394, San Fran- 
cisco, CA, December 2004. ACM SIGOPS. 

M. Sivathanu, V. Prabhakaran, A. C. Arpaci- 
Dusseau, and R. H. Arpaci-Dusseau. Improving 
storage system availability with D-GRAID. In 
Proceedings of the Third USENIX Conference on 


17th USENIX Security Symposium 


[19 


— 


[20] 


[21] 


[22] 


[23] 


File and Storage Technologies (FAST 2004), pages 
15-30, San Francisco, CA, March/April 2004. 
USENIX Association. 

Craig A. N. Soules, Garth R. Goodson, John D. 
Strunk, and Gregory R. Ganger. Metadata effi- 
ciency in versioning file systems. In Proceedings of 
the Second USENIX Conference on File and Stor- 
age Technologies (FAST ’03), pages 43-58, San 
Francisco, CA, March 2003. USENIX Association. 
J. D. Strunk, G. R. Goodson, M. L. Schein- 
holtz, C. A. N. Soules, and G. R. Ganger. Self- 
securing storage: Protecting data in compromised 
systems. In Proceedings of the 4th Usenix Sympo- 
sium on Operating System Design and Implemen- 
tation (OSDI ’00), pages 165-180, San Diego, CA, 
October 2000. USENIX Association. 

D. Teigland and H. Mauelshagen. Volume man- 
agers in linux. In Proceedings of the Annual 
USENIX Technical Conference, FREENIX Track, 
pages 185-197, Boston, MA, June 2001. USENIX 
Association. 

Walter F. Tichy. RCS — a system for ver- 
sion control. Software: Practice and Experience, 
15(7):637-654, 1985. 

Q. Yang, W. Xiao, and J. Ren. TRAP-array: A 
disk array architecture providing timely recovery 
to any point-in-time. In Proceedings of the 33rd 
Annual International Symposium on Computer Ar- 
chitecture (ISCA ’06), pages 289-301. IEEE Com- 
puter Society, 2006. 


USENIX Association 


Privacy-Preserving Location Tracking of Lost or Stolen Devices: 
Cryptographic Techniques and Replacing Trusted Third Parties with DHTs 


Thomas Ristenpart* Gabriel Maganis* 


*University of California, San Diego 
tristenp@cs.ucsd.edu 


Abstract 


We tackle the problem of building privacy-preserving 
device-tracking systems — or private methods to assist in 
the recovery of lost or stolen Internet-connected mobile 
devices. The main goals of such systems are seemingly 
contradictory: to hide the device’s legitimately-visited 
locations from third-party services and other parties (/o- 
cation privacy) while simultaneously using those same 
services to help recover the device’s location(s) after it 
goes missing (device-tracking). We propose a system, 
named Adeona, that nevertheless meets both goals. It 
provides strong guarantees of location privacy while pre- 
serving the ability to efficiently track missing devices. 
We build a version of Adeona that uses OpenDHT as the 
third party service, resulting in an immediately deploy- 
able system that does not rely on any single trusted third 
party. We describe numerous extensions for the basic de- 
sign that increase Adeona’s suitability for particular de- 
ployment environments. 


1 Introduction 


The growing ubiquity of mobile computing devices, and 
our reliance upon them, means that losing them is simul- 
taneously more likely and more damaging. For example, 
the annual CSI/FBI Computer Crime and Security Sur- 
vey ranks laptop and mobile device theft as a prevalent 
and expensive problem for corporations [16]. To help 
combat this growing problem, corporations and individ- 
uals are deploying commercial device-tracking software 
— like “LoJack for Laptops” [1] — on their mobile de- 
vices. These systems typically send the identity of the 
device and its current network location (e.g., its IP ad- 
dress) over the Internet to a central server run by the 
device-tracking service. After losing a device, the ser- 
vice can determine the location of the device and, subse- 
quently, can work with the owner and legal authorities to 
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recover the device itself. The number of companies of- 
fering such services, e.g., [1, 9, 21, 29, 34, 37, 38], attests 
to the large and growing market for device tracking. 


Unfortunately, these systems are incompatible with 
the oft-cited goal of location privacy [17, 22, 23] since 
the device-tracking services can always monitor the lo- 
cation of an Internet-enabled device — even while the 
device is in its owner’s possession. This presents a signif- 
icant barrier to the psychological acceptability of track- 
ing services. To paraphrase one industry representative: 
companies will deploy these systems in order to track 
their devices, but they won’t like it. The current situation 
leaves users of mobile devices in the awkward position of 
either using tracking services or protecting their location 
privacy. 

We offer an alternative: privacy-preserving device- 
tracking systems. Such a system should provide strong 
guarantees of location privacy for the device owner’s le- 
gitimately visited locations while nevertheless enabling 
tracking of the device after it goes missing. It should do 
so even while relying on untrusted third party services to 
store tracking updates. 


The utility of device tracking systems. Before div- 
ing into technical details, we first step back to reevalu- 
ate whether device tracking, let alone privacy-preserving 
device tracking, even makes sense as a legitimate secu- 
rity tool for mobile device users. A motivated and suf- 
ficiently equipped or knowledgeable thief (i.e., the mali- 
cious entity assumed in possession of a missing device) 
can always prevent Internet device tracking: he or she 
can erase software on the device, deny Internet access, 
or even destroy the device. One might even be tempted 
to conclude that the products of [1, 9, 21, 29, 34, 37, 38] 
are just security “snake oil”. 

We purport that this extreme view of security is in- 
appropriate for device tracking. While device tracking 
will not always work, these systems can work, and ven- 
dors (who may be admittedly biased) claim high recov- 
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ery rates [1]. The common-case thief is, after all, often 
opportunistic and unsophisticated, and it is against such 
thieves that tracking systems can clearly add significant 
value. Our work aims to retain this value while simulta- 
neously addressing the considerable threats to user loca- 
tion privacy. 


System goals. A device tracking system consists of: 
client hardware or software logic installed on the device; 
(sometimes) cryptographic key material stored on the de- 
vice; (sometimes) cryptographic key material maintained 
separately by the device owner; and a remote storage fa- 
cility. The client sends location updates over the Inter- 
net to the remote storage. Once a device goes missing, 
the owner or authorized agent searches the remote stor- 
age for location updates pertaining to the device’s current 
whereabouts. 

To understand the goals of a privacy-preserving track- 
ing system, we begin with an exploration of existing or 
hypothetical tracking systems in scenarios that are de- 
rived from real situations (Section 2). This reveals a re- 
strictive set of deployment constraints (e.g., supporting 
both efficient hardware and software clients) and an intri- 
cate threat model for location privacy where the remote 
storage provider is untrusted, the thief may try to learn 
past locations of the device, and other outsiders might 
attempt to glean private data from the system or “piggy- 
back” on it to easily track a device. We extract the fol- 
lowing main system goals. 

(1) Updates sent by the client must be anonymous and 
unlinkable. This means that no adversary should 
be able to either associate an update to a particular 
device, or even associate two updates to the same 
(unknown) device. 

(2) The tracking client must ensure forward-privacy, 
meaning a thief, even after seeing all of the inter- 
nal state of the client, cannot learn past locations of 
the device. 


(3) The client should protect against timing attacks by 
ensuring that the periodicity of updates cannot be 
easily used to identify a device. 


(4) The owner should be able to efficiently search the 
remote storage in a privacy-preserving manner. 


(5) The system must match closely the efficiency, de- 
ployability, and functionality of existing solutions 
that have little or no privacy guarantees. 

These goals are not satisfied by straightforward or exist- 

ing solutions. For example, simply encrypting location 

updates before sending to the remote storage does not 
allow for efficient retrieval. As another example, mecha- 
nisms for generating secure audit logs [32], while seem- 
ingly applicable, in fact violate our anonymity and un- 
linkability requirements by design. 

We emphasize that one non-goal of our system is im- 
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proved device tracking. As discussed above, all tracking 
systems in this category have fundamental limitations. 
Indeed, our overarching goal is to show that, in any set- 
ting where deploying a device tracking system makes 
sense, one can do so effectively without compromising 
privacy. 


Adeona. Our system, named Adeona after the Roman 
goddess of “safe returns,’ meets the aggressive goals 
outlined above. The client consists of two modules: a 
location-finding module and a cryptographic core. With 
a small amount of state, the core utilizes a forward-secure 
pseudorandom generator (FSPRG) to efficiently and 
deterministically encapsulate updates, rendering them 
anonymous and unlinkable, while also scheduling them 
to be sent to the remote storage at pseudorandomly deter- 
mined times (to help mitigate timing attacks). The core 
ensures forward-privacy: a thief, after determining all of 
the internal state of the client and even with access to all 
data on the remote storage, cannot use Adeona to reveal 
past locations of the device. The owner, with a copy of 
the initial state of the client, can efficiently search the 
remote storage for the updates. The cryptographic core 
uses only a sparing number of calls to AES per update. 


The cryptographic techniques in the Adeona core have 
wide applicability, straightforwardly composing with 
any location-finding technique or remote storage instan- 
tiation. We showcase this by implementing Adeona as 
a fully functional tracking system using a public dis- 
tributed storage infrastructure, OpenDHT [30]. We could 
also have potentially used other distributed hash table in- 
frastructures such as the Azureus BitTorrent DHT. Using 
a DHT for remote storage means that there is no sin- 
gle trusted infrastructural component and that deploy- 
ment can proceed immediately in a community-based 
way. End users need simply install a software client to 
enable private tracking service. Our system provides the 
first device tracking system not tied to a particular ser- 
vice provider. Moreover, to the best of our knowledge, 
we are also the first to explore replacing a centralized 
trusted third-party service with a decentralized DHT. 


Extensions. Adeona does make slight trade-offs be- 
tween simplicity, privacy, and device tracking. We ad- 
dress these trade-offs with several extensions to the ba- 
sic Adeona system. These extensions serve two pur- 
poses: they highlight the versatility of our basic privacy- 
enhancing techniques and they can be used to better pro- 
tect the tracking client against technically sophisticated 
thieves (at the cost of slight increases in complexity). 
In particular, we discuss several additions to the basic 
functionality of Adeona. For example, we design a novel 
cryptographic primitive, a tamper-evident FSPRG, to al- 
low detection of adversarial modifications to the client’s 
state. 
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Implementation and field testing. We have imple- 
mented the Adeona system and some of its extensions 
as user applications for Linux and Mac OS X. Moreover, 
we conducted a short trial in which the system was de- 
ployed on real users’ systems, including a number of lap- 
tops. Our experience suggests that the Adeona system 
provides an immediate solution for privacy-preserving 
device tracking. The code is currently being readied for 
an open-source public release to be available at http: // 
adeona.cs.washington.edu/, and we encourage the 
further use of this system for research purposes. 


Outline. In the next section we provide a detailed dis- 
cussion of tracking scenarios that help motivate our (in- 
volved) design constraints and threat models. Readers 
eager for technical details might skip ahead to Section 3, 
which describes the Adeona core. The full system based 
on OpenDHT is given in Section 4. We provide a se- 
curity analysis in Section 5. Our implementations, their 
evaluation, and the results of the field trial appear in 
Section 6. We discuss Adeona’s suitability for further de- 
ployment settings in Section 7 and extensions to Adeona 
are detailed in Section 8. We conclude in Section 9. 


2 Problem Formulation 


To explore existing and potential tracking system de- 
signs and understand the variety of adversarial threats, 
we first study a sequence of hypothetical tracking sce- 
narios. While fictional, the scenarios are based on real 
stories and products. These scenarios uncover issues that 
will affect our goals and designs for private device track- 
ing. 


Scenario 1. Vance, an avid consumer of mobile devices, 
recently heard about the idea of “LoJack for Laptops.” 
He searches the Internet, finds the EmailMe device track- 
ing system, and installs it on his laptop.! The EmailMe 
tracking client software sends an email (like the example 
shown in Figure 1) to his webmail account every time 
the laptop connects to the Internet. Months later, Vance 
is distracted while working at his favorite coffee shop, 
and a thief takes his laptop. Now Vance’s foresight ap- 
pears to pay off: he uses a friend’s computer to access 
the tracking emails sent by his missing laptop. Work- 
ing with the authorities, they are able to determine that 
the laptop last connected to the Internet from a public 
wireless access point in his home city. Unfortunately the 
physical location was hard to pinpoint from just the IP 
addresses. A month after the theft Vance stops receiving 
tracking emails. An investigation eventually reveals that 
the thief sold the laptop at a flea market to an unsuspect- 
ing customer.” That customer later resold the laptop at a 
pawn shop. The pawnbroker, before further reselling the 
laptop, must have refurbished the laptop by wiping its 
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hard drive and installing a fresh version of the operating 
system. 


Discussion: The theft of Vance’s laptop highlights a few 
issues regarding limitations on the functionality of de- 
vice tracking systems. First, a client without hardware- 
support can provide network location data only when 
faced by such a flea-market attack: these occur when a 
technically unsophisticated thief steals a device to use it 
or Sell it (with its software intact) as quickly as possible. 
Second, network location information will not always be 
sufficient for precisely determining the physical location 
of a device. Third, all clients (even those with hardware 
support) can be disabled from sending location updates 
(simply by disallowing all Internet access or by filtering 
out just the location updates if they can be isolated). 

The principal goal of this paper is not to achieve bet- 
ter Internet tracking functionality than can be offered by 
existing solutions. Instead, we address privacy concerns 
while maintaining device tracking functionality equiva- 
lent to solutions with no or limited privacy guarantees. 
The next scenarios highlight the types of privacy con- 
cerns inherent to tracking systems. 


Scenario 2. A few weeks before the theft of Vance’s 
laptop, Vance was the target of a different kind of at- 
tack. His favorite coffee shop had been targeted by crack- 
ers because the shop is in a rich neighborhood and their 
routers are not configured to use WPA [19]. The crackers 
recorded all the coffee shop’s traffic, including Vance’s 
location-update emails, which were not encrypted. (The 
webmail service did not use TLS, nor does the EmailMe 
client encrypt the outgoing emails.) The crackers sell the 
data garnered from Vance’s tracking emails to identity 
thieves, who then use Vance’s identity to obtain several 
credit cards. 


Discussion: The content of location updates should al- 
ways be sent via encrypted channels, lest they reveal 
private information to passive eavesdroppers. This is of 
particular importance for mobile computing devices, be- 
cause of their almost universal use of wireless communi- 
cation, which may or may not use encryption. 


Scenario 3. Vance works as a salesman for a small 
distributor of coffee-related products, called Very Good 
Coffee (VGC). He recently went on a trip abroad for 
VGC to investigate purchasing a supplier of coffee beans. 
On his return trip, he was stopped at customs and 
his laptop was temporarily confiscated for an “inspec- 
tion” [28, 33]. Vance, with his ever-present foresight, had 
predicted this would happen: he encrypted all his sensi- 
tive work-related files and removed any information that 
might leak what he had been doing while in country. The 
laptop was shortly returned with files apparently unmod- 
ified. 

Unknown to Vance, the EmailMe client had cached 
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From: tech@brigadoonsoftware.com 
To: tech@brigadoonsoftware.com 
BCC: tomrist@gmail.com 


Subject: Information 


Date: 16-08-2007 

Time: 11:14:05 

Computer Name : TOM-8F760D01401 
User Name : LOCAL SERVICE 
IPAddress :0.0.0.0 

IPAddress :128.208.7.80 





PCPH Pro For Win 95/98/ME/NT/2K/XP - Version 3.0 (Eval) 


Mac Address: 00-18-8B-A2-05-E5 
Mac Address: 00-18-DE-9B-FO-5A 
Serial Number: DC44BF26 
Registrants Name: Tom 
Organization: Tom 

Address: 513 Brooklyn Avenue 
City: Seattle 

State/Province: WA 


Zip/Postal Code: 98105 
Country: USA 
Work Phone: 2066163997 








Figure 1: Example tracking email sent (unencrypted) by PC Phone Home [9] from one of the authors’ laptops. 


all the recently visited network locations on the laptop. 
Included were several IP addresses used by the supplier 
that VGC intended to purchase. The customs agents sold 
this information to a local competitor of VGC. Using this 
tip, the local competitor successfully blocked VGC’s bid 
to purchase the supplier. 


Discussion: This scenario addresses the need for for- 
ward privacy. A tracking client should not cache previ- 
ous locations, lest a thief (or even, as the scenario depicts, 
some other untrusted party with temporary access to the 
device) easily break the owner’s past location privacy. 


Scenario 4. Hearing about Vance’s recent troubles with 
property and identity theft, the VGC management chose 
to contract with (the optimistically named) All Devices 
Recovered (AllDevRec) to provide robust tracking ser- 
vices for VGC’s mobile assets. AllDevRec, having made 
deals with laptop manufacturers, ensures that VGC’s 
new laptops have hardware-supported tracking clients in- 
stalled. The clients send updates using a proprietary 
protocol over an an encrypted channel to AllDevRec’s 
servers each time an Internet connection is made.* 

Jan, a recovery-management technician employed by 
AllDevRec, has a good friend Eve who happens to work 
at a business that competes with VGC. Ian brags to Eve 
that his position in AllDevRec allows him to access the 
locations from which VGC’s employees access the Inter- 
net. This gives Eve an idea, and so she goads Ian into 
giving her information on the network locations visited 
by VGC sales people. From this Eve can infer the coffee 
shops VGC is targeting as potential customers, allowing 
her company to precisely undercut VGC’s offerings. 


Discussion: Using encrypted channels is insufficient to 
guarantee data privacy once the location updates reach 
a service provider’s storage systems. The location up- 
dates should remain encrypted while stored. This mit- 
igates the level of trust device owners must place in a 
service provider’s ability to enforce proper data manage- 
ment policies (to protect against insider attacks) and se- 
curity mechanisms (to protect against outsiders gaining 
access). 
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Scenario 5. Vance, now jobless due to VGC’s recent 
bankruptcy, has been staying at Valerie’s place. Va- 
lerie works at a large company, with its own in-house IT 
staff. The management decided to deploy a comprehen- 
sive tracking system for mobile computing asset man- 
agement. To ensure employee acceptability of a tracking 
system, the management had the IT staff implement a 
system with privacy and security issues in mind: each 
device is assigned a random identification number and 
a public key, secret key pair for a public-key encryption 
scheme. The database mapping a device to its identifi- 
cation number, public key, and secret key is stored on 
a system with several procedural safeguards in place to 
ensure no unwarranted accesses. With each new Internet 
connection, the tracking client sends an update encrypted 
under the public key and indexed under the random iden- 
tification number. 


When Valerie goes to lunch (which varies in time quite 
a bit depending on her work), she heads across the street 
to a cafe to get away from the office. She often uses 
her company laptop and the cafe’s wireless to peruse the 
Internet. Since deployment of the new tracking system, 
Valerie has been complaining that no matter when she 
takes lunch, Irving (a member of the IT staff who is re- 
puted to have an unreciprocated romantic interest in her) 
almost always ends up coming by the cafe a few minutes 
after she arrives.* 


Because the location updates sent by Valerie’s laptop 
use a Static identifier, it was easy for Irving (even without 
access to the protected database) to infer which was hers: 
he looked at identifiers with updates originating from the 
block of IP addresses used within Valerie’s department 
and those used by the cafe. After a few guesses (which he 
validated by simply seeing if she was at the cafe), Irving 
determined her device’s identification number and from 
then on knew whenever she went for lunch. 


Discussion: The use of unchanging identifiers (even if 
originally anonymized) allows linking attacks, in which 
an adversary observing updates can associate updates 
from different locations as being from the same device. 
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Additionally, the finely-grained timing information re- 
vealed by sending updates upon each new Internet con- 
nection is a side-channel that can leak information. 


Summary. The sequence of scenarios depicts the wide 
variety of potential users of tracking systems. Moreover, 
they highlight two fundamental security goals. 


e Vance was a victim of compromised device tracking. 
(Scenario 1.) 


e Vance, VGC, and Valerie were all victims of compro- 
mised /ocation privacy. (Scenarios 2, 3, 4, and 5.) 


The threat models related to achieving location privacy 
while retaining device tracking capabilities are complex 
because there exist numerous adversaries with widely 
varied powers and motivation: 


e The unscrupulous party in possession of a device, 
which we will simply call the thief. The thief might be 
unsophisticated, sophisticated and intent on disabling 
the tracking device, or sophisticated and wish to reveal 
past locations. 


e Internet-connected outsiders that might intercept up- 
date traffic (e.g., the crackers at the coffee shop). 
Such adversaries call for ensuring the use of encrypted 
channels. 


e The remote storage provider, or the entity control- 
ling the system(s) that host location updates, might 
be untrustworthy, suggesting the need for location up- 
dates that are anonymous, unlinkable, and encrypted, 
thereby denying private information even to the re- 
mote storage provider. 


3 The Adeona Core: Providing Anony- 
mous, Unlinkable Updates 


The core module is the portion of a client primarily re- 
sponsible for preparing, scheduling, and sending location 
updates to the remote storage. The Adeona core is, con- 
sequently, the foundation of our tracking system’s pri- 
vacy properties. We treat its development first, and men- 
tion that the core stands by itself as a component that will 
work in numerous deployment settings, in addition to the 
setting handled by the full Adeona system (described in 
the next section). 

The discussion in Section 2 illustrates that the Adeona 
core must provide mechanisms to 
(1) ensure content sent to the remote storage is anony- 

mous and unlinkable; 


(2) ensure forward-privacy (stored data on the client 
should not be sufficient for revealing previous lo- 
cations); 


(3) mitigate timing attacks; and 
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(4) allow the owner to efficiently search the remote 
storage for updates. 


Basic design. A first approach for building a core would 
be to just utilize a secure symmetric encryption scheme. 
That is, the owner could install on the client a secret key 
and also store a copy separately, perhaps printed on a 
piece of paper or stored on a secure removable token. 
For each new Internet connection, the core would en- 
crypt the location data using this secret key and imme- 
diately send the ciphertext to the remote storage. Goal 
(1) above would be satisfied because (assuming one used 
a standard, secure encryption scheme) these ciphertexts 
would, indeed, be anonymous and unlinkable. But, the 
other three goals are not met. A thief that gets access 
to the device and the secret key could decrypt previous 
updates. Sending the ciphertext immediately upon de- 
tecting a new Internet connection also leaks fine-grained 
timing information. More importantly, since ciphertexts 
submitted by all users are anonymous, there is no effi- 
cient way for the owner to search the database for his 
updates.° 

The Adeona core utilizes a more sophisticated ap- 
proach to tackle the other goals while preserving the abil- 
ity to address goal (1). Instead of a key for an encryp- 
tion scheme, the owner initializes the client with a se- 
cret cryptographic seed for a pseudorandom generator 
(PRG) [6]. Each time the core is run it uses the PRG and 
the seed to deterministically generate two fresh pseudo- 
random values: an index and a secret key (for the en- 
cryption scheme). The location information is encrypted 
using the secret key. The core sends both the index and 
the ciphertext to the remote storage. As before the ci- 
phertext reveals no information, but the index is pseudo- 
random as well, meaning the entire update is anonymous 
and unlinkable. Thus goal (1) is satisfied. Goal (4) is 
as well: the owner, having a copy of the original crypto- 
graphic seed, can recompute all of the indices and keys 
used. This allows for efficient search of the remote stor- 
age for his or her updates, using the indices. The indices 
do not reveal decryption keys nor past or future indices. 

This approach does not yet satisfy goal (2), because a 
thief — or customs official — can also use the seed to 
generate all the past indices and keys. We can rectify 
this by using a forward-secure pseudorandom genera- 
tor (FSPRG) [5]: instead of using a single cryptographic 
seed for the lifetime of the system, the core also evolves 
the seed pseudorandomly. When run, the core uses the 
FSPRG and the seed to generate an index, secret key, and 
a new seed. The old seed is discarded (securely erased). 
The properties of the FSPRG ensure that it is computa- 
tionally intractable to “go backwards” so that previous 
seeds (and the associated indices and keys) remain un- 
known even to a thief with access to the current seed. 

Finally we can address goal (3) by randomly select- 
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Figure 2: (Left) The Adeona core, where E is a block cipher (e.g., AES) instantiating the FSPRG and Enc is a standard encryption 
scheme. (Right) Close-up of the core’s forward-private location caching, where the cache holds 3 updates and shown are two new 


locations being stored. 


ing times to send updates. Using the FSPRG as a 
source of randomness, we can pseudorandomly gener- 
ate exponentially-distributed inter-update times. (This 
allows the owner to also recompute the inter-update 
times, which will be useful for retrieval as discussed in 
Section 4.) Such a distribution is memoryless, meaning 
that, from the storage provider’s view, the next update 
is equally likely to come from any client. We can tune 
the number of updates sent by adjusting the rate of the 
exponential distribution used. 


Forward-private location caching. Our pseudorandom 
update schedule means that we might miss locations that 
are visited for only a short amount of time. However, to 
provide maximal evidentiary forensic data about the tra- 
jectories of a device after theft, we would like the core 
to allow reporting all of the recently visited locations. 
We could cache recent locations, but this breaks forward- 
privacy. We therefore enhance the basic design to include 
a forward-private location cache. Having a cache also 
provides a simple mechanism for adding temporal redun- 
dancy to updates (i.e., location data is sent multiple times 
to the remote storage over time), which can increase the 
ability to successfully retrieve updates. 


Instead of just caching location data in the clear, we 
can have the core immediately encrypt new data sent 
from the location-finding module. The resulting cipher- 
text can then be added to a cache; the least recent ci- 
phertext is expelled. However, we cannot just utilize the 
encryption key generated by the current state’s FSPRG: a 
thief could decrypt any ciphertexts in the cache that were 
added since the last time the FSPRG seed was refreshed 
(e.g., when the previous update was sent). We therefore 
use a distinct FSPRG seed, which we call the cache seed, 
as the source for generating encryption keys for each lo- 
cation encountered. Each time the cache seed is used to 
encrypt new location data, it is also used to generate a 
new cache seed and the prior one is securely erased. In 
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this way we guarantee forward privacy: no data in the 
core allows a thief to decrypt previously generated ci- 
phertexts. When its time to send an update, the entire 
cache is encrypted using the secret key generated by the 
FSPRG with the main seed. This (second) encryption 
ensures that the data stored at the remote storage cannot 
later be correlated with ciphertexts in the cache. Finally, 
the core “resets” the cache seed by generating a fresh 
one using the FSPRG and the main seed. This associates 
a sequences of cache seeds to a particular update state. 
We ensure freshness of location data by mandating that 
at least one newly generated ciphertext is included with 
each update submitted to the remote storage. 


The owner can reconstruct all of the cache seeds for 
any state (using the prior state’s main seed) and do trial 
decryption to recover locations. (The number of ex- 
pected trials is the number of locations visited in between 
two updates, and so this will be typically small.) Cipher- 
texts in the cache that are “leftover” from a prior update 
time period can also be decrypted, and this can be ren- 
dered efficient if plaintexts include a hint (i.e., the num- 
ber of states back) that specifies which state generated 
the keys for the next ciphertext entry. 


Implementing the design. Implementing the Adeona 
core is straightforward, given a block cipher® such as 
AES. A standard and provably secure FSPRG implemen- 
tation based on AES works as follows [5]. A crypto- 
graphic seed is just an AES key (16 bytes). To generate a 
string of pseudorandom bits, one iteratively applies AES, 
under key a seed s, to a counter: AES(s,1), AES(s,2), 
etc. For Adeona, we have an initial main seed s; and ini- 
tial cache seed cj; (both randomly generated). The main 
seed 5; is used to generate a new seed s2 = AES(s1,0), 
the next state’s cache seed cy; = AES(s,,1), and so on 
for the encryption key, index, and time offset. (The ex- 
ponentially distributed time offset is generated from a 
pseudorandom input using the well known method of 
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inverse-transform sampling [13].) A seed, after it is used, 
must be securely erased. The cache seed forms a sepa- 
rate branch of the FSPRG and is used to generate a se- 
quence of cache seeds and intermediate encryption keys 
for use within the cache. Figure 2 provides a diagram 
of the core module’s operation between two successive 
updates at times 7;_; and 7;. 

The encryption scheme can also be built using just 
AES, via an efficient block cipher mode such as 
GCM [26]. Such a mode also provides authenticity. Of 
added benefit is that the mode can be rendered determin- 
istic (i.e., no randomness needed) since we only encrypt 
a single message with each key. This means that the core 
(once initialized) does not require a source of true ran- 
domness. 


Summary. To summarize, the core uses a sequence of 
secret seeds 1, 52,... to provide 


e asequence /),/2,... of pseudorandom indices to store 
ciphertexts under, 


@ sequences Cj 1,Cj,2,... Of secret cache seeds for each 
state 7 that are then used to encrypt data about each 
location visited, 


e asequence Kj,K2,... of secret keys for encrypting the 
cache before submission to the remote storage, and 


e a sequence 6),62,... of pseudorandom inter-update 
times for scheduling updates 


while providing the following assurances. Given any Jj, 
Kj, or 6), no adversary can (under reasonable assump- 
tions) compute any of the other output values above. Ad- 
ditionally, even if the thief views the entire internal state 
of the core, it still cannot compute any of the core’s pre- 
viously used indices, cache seeds, encryption keys, or 
inter-update times. 


4 The Adeona System: Private Tracking 
using OpenDHT 


A (privacy-preserving) tracking system consists of three 
main components: the device, the remote storage; and 
an owner. The device component itself consists of a 
location-finding component and a core component; other 
components — such as a camera image capture function- 
ality — can easily be incorporated. A system works in 
three phases: initialization, active use, and retrieval. We 
have already seen the Adeona core. In this section we 
show how to construct a complete privacy-preserving de- 
vice tracking system using it. 

Our target is to develop an open-source, immediately 
deployable system. This will allow evaluation of our 
techniques during real usage (see Section 6), not to men- 
tion providing to individual users an immediate (and, to 
our knowledge, first) alternative to the plethora of exist- 
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ing, proprietary tracking systems, none of which achieve 
the level of privacy that we target and that we believe will 
be important to many users. Along these lines, this sec- 
tion focuses on a model for a open source software-only 
client. We use the public distributed storage infrastruc- 
ture OpenDHT [30] for the remote storage facility. Not 
only does this obviate the need to setup dedicated remote 
storage facilities, enabling immediate deployability, but 
it effectively removes our system’s reliance on any single 
trusted third party. This adds significantly to the practical 
privacy guarantees of the system. 

We now flesh out the design of the complete Adeona 
system. The client consists of the Adeona core of the pre- 
vious section (with a few slight modifications described 
below) plus a location-finding module, described below. 
First, however, we describe the other components: us- 
ing OpenDHT for remote storage and how to perform 
privacy-preserving retrieval. We conclude the section 
with a summary of the whole system. 


OpenDHT as remote storage. A distributed hash ta- 
ble (DHT) allows insertion and retrieval of data values 
based on hash keys. OpenDHT is an implementation of 
a distributed hash table (DHT) whose nodes run on Plan- 
etLab [11]. We use the indices generated by the Adeona 
core as the hash keys and store the ciphertext data un- 
der them. There are several benefits to using a public, 
open-source distributed hash table (DHT) as remote stor- 
age. First, existing DHT’s such as OpenDHT are already 
deployed and usable, meaning deployment of the track- 
ing system only requires distribution of software for the 
client and for retrieval. Second, a DHT can naturally 
provide strengthened privacy and security guarantees be- 
cause of the fact that updates will be stored uniformly 
across all the nodes of the DHT. In decentralized DHTs, 
an attacker would have to corrupt a significant fraction of 
DHT nodes in order to mount Denial-of-Service or pri- 
vacy attacks as the storage provider. 

On the other hand, DHT’s also have limitations. The 
most fundamental is a lack of persistence guarantee: 
the DHT itself provides no assurance that inserted data 
can always be retrieved. Fortunately, OpenDHT ensures 
that inserted data is retained for at least a week.’ An- 
other limitation is temporary connectivity problems. Of- 
ten nodes, even in OpenDHT, can be difficult to access, 
meaning our client will not be able to send an update suc- 
cessfully. The traditional approaches for handling such 
issues is to use client-side replication. This means that 
the client submits the same data to multiple, widely dis- 
tributed nodes in the DHT. 

We can enhance the Adeona core to include such a 
replication mechanism easily: have the core generate 
several indices (as opposed to just one) for each update. 
These indices, being pseudorandom already, will be dis- 
tributed uniformly across the the space of all DHT nodes. 
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The update can then be submitted under all of these in- 
dices. 


Scheduling location updates. The Adeona core pro- 
vides a method to search for update ciphertexts via the 
deterministically generated indices. As noted, querying 
the remote storage for a set of indices does not reveal 
decryption keys or past or future indices. However, just 
the fact that a set of indices are queried for might al- 
low the remote storage provider to trivially associated 
them to the same device. While the distributed nature 
of OpenDHT mitigates this threat, defense-in-depth asks 
that we do better. We therefore want a mechanism that 
ensures the owner can precisely determine which indices 
to search for when performing queries, and in particu- 
lar allow him to avoid querying indices used before the 
device was lost or stolen. 


To enable this functionality, we have the system pre- 
cisely (but still pseudorandomly) schedule updates rela- 
tive to some clock. The clock could be provided, for ex- 
ample, by a remote time server that the client and owner 
can synchronize against. Then, when the owner initial- 
izes the client, in addition to picking the cryptographic 
seed it also stores the current time as the initial time 
stamp T;. Each subsequent state also has a time stamp 
associated with it: T>, 73, etc. These indicate the state’s 
scheduled send time, and 7;+; is computed by adding 7; 
and 6; (the pseudorandom inter-update delay). When the 
client is run, it reads the current time from the clock and 
iterates past states whose scheduled send time have al- 
ready past. (In this way the core will “catch up” the state 
to the schedule.) With access to a clock loosely synchro- 
nized against the client’s, the owner can accurately re- 
trieve updates sent at various times (e.g., last week’s up- 
dates, all the updates after the device went missing, etc.). 
We discuss the assumption of a clock more in our secu- 
rity analysis in Section 5. 


Location-finding module. Our system works modu- 
larly with any known location finding technique (e.g., 
determining external IP address, trace routes to nearby 
routers, GPS, nearby 802.11 or GSM beacons, etc.). 
We implemented three different location-finding mech- 
anisms: light, medium, and full. The light mechanism 
just determines the internal IP address and the externally- 
facing IP address. (The latter being the IP as reported by 
an external server.) The medium mechanism addition- 
ally performs traceroutes to 8 randomly-chosen Planet- 
Lab nodes. These traceroutes provide additional infor- 
mation about the device’s current surrounding network 
topology. The full mechanism employs a protocol that 
adapts state-of-the-art geolocationing techniques to our 
setting. Here, geolocationing refers to determining (ap- 
proximate) physical locations from network data. Tradi- 
tional approaches utilize a distributed set of landmarks 
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to actively probe a target [18]. These probes, combined 
with the knowledge of the physical locations of the land- 
marks, allows approximate geolocationing of the target. 
We flip this approach around, using the active-client na- 
ture of our setting to have the client itself find nearby 
passive landmarks. 


Concretely, we utilize Akamai [2] nodes as landmarks: 
they are numerous, widespread, and often co-located 
within ISPs (ensuring some node is usually very close 
to the device). Akamai is purported to have about 25 000 
hosts distributed across 69 countries [2]. In a one-time 
pre-processing step, we can enumerate as many of their 
nodes as possible and then apply an existing virtual net- 
work coordinate system, Vivaldi [12], to assign them co- 
ordinates. The location-finding module chooses several 
nodes randomly out of this set, probes them to obtain 
round-trip times, then uses these values and the nodes’ 
pre-computed virtual coordinates to determine the de- 
vice’s own virtual coordinates. Based on this, the module 
determines an additional set of landmarks that are close 
to it in virtual coordinate space and issues network mea- 
surements (pings and traceroutes) to these close land- 
marks. These measurements, in addition to the device’s 
current internal- and external-facing IP addresses, are 
submitted to the core module as the current location in- 
formation. After retrieval, this information can be used 
to geolocate the device, by potentially contacting the ISP 
hosting the edge routers. 


Putting it all together. We describe the Adeona system 
in its entirety. A state of the client consists of the main 
cryptographic seed, the cache and its seed, and a time 
stamp. The main seed is used with an FSPRG to gener- 
ate values associated to each state: the DHT indices, an 
encryption key, and an inter-update time. It also gener- 
ates the next state’s main seed and the next state’s cache 
seed. The time stamp represents the time at which the 
current state should be used to send location information 
to the remote storage. 


e (Initialization) The owner initializes the client by 
choosing random seeds and recording the time of ini- 
tialization as the first state’s time stamp. The cache is 
filled with random bits. 


e (Active use) The main loop of the client proceeds as 
follows. The client, when executed, reads the current 
state and retrieves the current time (from, for example, 
the system clock). The client then transitions forward 
to the state that should be used to send the next update, 
based on the current time and the states’ scheduled 
send times. The location cache uses its seed to ap- 
propriately encrypt each new location update received 
from the location module. At the scheduled send time, 
the main seed is used to generate several indices and 
an encryption key. The latter is used to encrypt the en- 
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tire cache. The result is inserted into OpenDHT under 
each index. The client then transitions to the next state. 
This means generating the next state’s seed, the next 
state’s cache seed, and the scheduled update time (the 
sum of the current update time and the inter-update 
delay). The old state data, except the cache, is erased. 


e (Retrieval) To perform retrieval, the owner can use his 
or her copy of the initial state to recompute the se- 
quence of states, their scheduled send times, and their 
associated indices and keys. From this information, 
he or she can determine the appropriate indices to 
search the remote storage (being careful to avoid in- 
dices from before the device went missing). After re- 
trieving the caches, the owner can decrypt as described 
in Section 3. 


5 Security Analysis 


The Adeona system is designed to ensure location pri- 
vacy, while retaining as much as possible the tracking 
abilities of solutions that provide weaker or no privacy 
properties. While we discuss other security evaluations 
and challenges inline in other sections, we treat here sev- 
eral key issues. 


Location privacy. We discuss privacy first. We assume 
a privacy set of at least two participating devices, and 
do not consider omniscient adversaries that, in particu- 
lar, can observe traffic at all locations visited by the de- 
vice. (Such a powerful adversary can trivially compro- 
mise location privacy, assuming the device uses a persis- 
tent hardware MAC address.) The goal of adversaries is 
to use the Adeona system to learn more than their a pri- 
ori knowledge about some device’s visited locations. Be- 
cause updates are anonymous and unlinkable, outsiders 
that see update traffic and the storage provider will not 
be able to associate the update to a device. The storage 
provider might associate updates that are later retrieved 
by the owner. This does not reveal anything about other 
updates sent by the owner’s device. The randomized 
schedule obscures timing-related information that might 
otherwise reveal which device is communicating an up- 
date. Note also that the landmarks probed in our geolo- 
cationing module only learn that some device is prob- 
ing them from an IP address. The thief cannot break the 
owner’s location privacy due to our forward privacy guar- 
antees. 

Outsiders and the storage provider do learn that some 
device is at a certain location at a specific time (but not 
which device). Also, the number of devices currently us- 
ing the system can be approximately determined (based 
on the rate of updates received), which could, for ex- 
ample, reveal a rough estimate of the number of de- 
vices behind a shared IP address. Moreover, these adver- 
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saries might attempt active attacks. For example, upon 
seeing an incoming update, the provider could immedi- 
ately try to finger-print the source IP address [24]. Dis- 
tributing the remote storage as with OpenDHT naturally 
makes such an attack more difficult to mount. There 
are also known preventative measures that mitigate a de- 
vice’s vulnerability to such attacks [24]. Finally, all of 
this could be protected against by sending updates via a 
system like Tor [14] (in deployment settings that would 
allow its use), which obfuscates the source IP address. 
See Section 8.4. 

We remark that custom settings for Adeona’s various 
parameters might reduce a device’s privacy set. For ex- 
ample, if a client utilizes a cache size distinct from oth- 
ers, then this will serve to differentiate that client’s up- 
dates. Likewise if a client submits more (or less) copies 
of each update to the remote storage, then the storage 
provider or outsiders might be able to differentiate its up- 
dates from those of other devices. Finally, a rate parame- 
ter significantly different from other clients’ could allow 
tracking of the device. 


Device tracking. We now discuss the goal of device 
tracking, which just means a system’s ability to en- 
sure updates about a missing device are retrieved by the 
owner. As mentioned previously, the goal here is for 
Adeona to engender the same tracking functionality as 
systems with weaker (or no) privacy guarantees. We 
therefore do not consider attacks which would also dis- 
able a normal tracking system: disabling the client, cut- 
ting off Internet access, destroying the device, etc. (Ex- 
isting approaches to mitigating these attacks, like clever 
software engineering and/or hardware or BIOS support, 
are also applicable to our designs.) Nevertheless, Adeona 
as described in the previous section does have some lim- 
itations in this regard. 


e OpenDHT does not provide everlasting persistence. 
This means that tracking fails for location updates 
more than a week old. Note that the location cache 
mechanism can be used to extend this time period. 
An alternate remote storage facility could also be used 
(see Section 7). 


e Adeona schedules its updates at random times. If the 
device has Internet access for only a short time, this 
means that Adeona could miss a chance to send its 
update. We can trivially mitigate this by increasing 
the rate of our exponentially-distributed inter-update 
times (i.e., increase the frequency of updates), but at 
the cost of efficiency since this would mean sending 
more updates. 


e The absolute privacy of retrieval relies on the device 
having a clock that the owner is loosely synchro- 
nized against. The client relies on the system clock 
to schedule updates. The thief could abuse this by, 
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for example, forcing the device’s system clock to not 
progress. In the current implementation this would 
disrupt sending updates. Solutions for this are dis- 
cussed in Section 8.1. 


e Adeona relies on a stored state, and a thief could dis- 
able Adeona by tampering with it. For example, flip- 
ping even a single bit of the state will make all future 
updates unrecoverable. To ensure that the thief has to 
disable the client itself (and not just modify its state) 
we can use a tamper-evident FSPRG in conjunction 
with a “panic” mode of operation. See Sections 8.2 
and 8.3. 


For some of these bullets, we recall that many thieves 
will be unsophisticated. Therefore, in the common case 
the likelihood of the above attacks are small. (And, in- 
deed, a sophisticated attacker could also compromise the 
tracking functionality of existing commercial, central- 
ized alternatives.) 

We also briefly mention that Adeona, like existing 
tracking systems, might not compose with some other 
mobile device security tools. For example, using a secure 
full-disk encryption system could render all software on 
the system unusable, including tracking software. We 
leave the question of how to securely combine tracking 
with other security mechanisms to future work. 

Finally, while not a primary goal of our design, it turns 
out that Adeona’s privacy mechanisms can actually im- 
prove tracking functionality. For one, the authentication 
of updates provided by our encryption mode means the 
owner knows that any received update was sent using 
the keys on the device, preventing in-transit tampering 
by outsiders or the storage provider. That updates are 
anonymous makes targeted Denial-of-Service attacks — 
in which the storage provider or an outsider attempts to 
selectively block or destroy an individual’s updates — 
exceedingly difficult, if not impossible. 


6 Implementation and Evaluation 


To investigate the efficiency and practicality of our 
system, we have implemented several versions of the 
Adeona system as user-land applications for both Linux 
and Mac OS X. In all the versions, we used AES to im- 
plement the FSPRG. Encryption was performed using 
AES in counter mode and HMAC-SHA1 [3] in a stan- 
dard Encrypt-then-MAC mode [4]. The OpenSSL crypto 
library’ provided implementations of these primitives. 
We note that HMAC was used for convenience only; 
an implementation using AES for message authentica- 
tion would also be straightforward. The rpcgen compiler 
was used to generate the client-side stubs for OpenDHT’s 
put-get interface over the Sun RPC protocol. We also 
used Perl scripts to facilitate installation. We focus on 
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three main versions. 


e adeona-0.2.1 implements the core functionality de- 
scribed in Sections 3 and 4. It uses the medium 
location-finding module of Section4. The source 
code for adeona-0.2.1, not including the libraries men- 
tioned above, consists of 7091 lines of unoptimized C 
code. (Count includes comments and blank lines, i.e. 
calculated via we -| *.[ch].) This version is being read- 
ied for public release. 


adeona-0.2.0 is a slightly earlier version of adeona- 
0.2.1 that differs in that it uses a simpler version of 
the forward-private location cache. Its cache only han- 
dles locations observed during scheduled updates (as 
opposed to more frequent checks for a change in lo- 
cation, meaning that locations could be missed if ill- 
timed). The source code for adeona-0.2.0 consists of 
5231 lines of unoptimized C code. This version was 
deployed in the field trial described in Section 6.3. 


adeona-0.1 uses the same ciphertext cache mech- 
anism as adeona-0.2.0, and additionally includes 
the tamper-evident FSPRG that will be described in 
Section 8.2, the panic mode that will be described in 
Section 8.3, and the full location-finding mechanism 
described in Section 4. The tamper-evident FSPRG 
is implemented using the signature scheme associated 
to the Boneh-Boyen identity-based encryption (IBE) 
scheme [7] and the anonymous IBE scheme is imple- 
mented using Boneh-Franklin [8] in a hybrid mode 
with the Encrypt-then-MAC scheme described above. 
The two schemes rely on the same underlying elliptic 
curves that admit efficiently computable bilinear pair- 
ings. It relies on the Stanford Pairings-Based Crypto 
(PBC) library version 0.4.11 [25] and specifically the 
“Type F” pairings. Not counting the PBC library, this 
version is implemented in 9 723 lines of C code. 


The oldest version was mainly for experimenting with 
the extensions discussed in Section 8 and the new geolo- 
cation technique discussed in Section 4, while the newer 
two versions were largely re-writes to prepare for public 
use. The source code for any version is directly available 
from the authors. 


6.1 Performance 


We ran several benchmarks to gauge the performance of 
our design mechanisms. The system hosting the experi- 
ments was a dual-core 3.20 GHz Intel Pentium 4 proces- 
sor with 1GB of RAM. It was connected to the Internet 
via a university network. 


Basic network operations. Table 2 gives the Wall-clock 
time in milliseconds (calculated via the gettimeofday sys- 
tem call) to perform each basic network operation: an 
OpenDHT put of a 1024-byte payload, an OpenDHT 
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| Min Mean | Median Max T/O 
| Put 207 1021 470 11 463 2 
| Get 2 240 77 11238 3 
| Loc medium 5 642 13270 15531 30381 _ 
| Loc full 17446 | 36802 36 197 63916 _ 























Table 1: Wall clock time in milliseconds/operation to per- 
form basic network operations: DHT put, DHT get, a medium 
location-finding operation, and a full location-finding opera- 
tion. 









































adeona-0.2.1 r=0 |) r=10 | r=100 
Owner state 75 75 75 
Client state (light) 75 876 8076 
Update (light) 36 400 4000 
Client state (medium) 75 27116 | 270476 
Update (medium) 1348 13 320 135 200 
adeona-0.1 r=0 | r=10 | r=100 





Owner state 3544 3545 3548 
Client state (full) | 1779 | 30824 | 292184 
Update (full) 1452 | 14520 | 145200 


























Table 2: Typical sizes in bytes of state and update data used by 
adeona-0.2.1 and adeona-0.1 on a 32-bit system, for different 
sizes of the ciphertext cache specified by r. 


get of a 1024-byte payload, the time to do the 8 tracer- 
outes used in the medium location-finding mechanism, 
and the time to do the full location-finding operation (as 
described in Section 4). Each operation was performed 
100 times; shown is the min/mean/median/max time over 
the successful trials. The number of time outs (failures) 
for the put trials and get trials are shown in the column 
labeled T/O. The time out for OpenDHT RPC calls was 
set to 15 seconds in the implementation. For the location 
mechanisms, hop timeouts for traceroutes and timeouts 
for pings were set to 2 seconds (here an individual probe 
time out does not signify failure of the operation). 


Space utilization. Table 2 details the space requirements 
in bytes of adeona-0.2.1 (adeona-0.2.0 has equivalent 
sizes) with light and medium location mechanisms and 
adeona-0.1 with the full location mechanism. Here, and 
below, the parameter 7 specifies the size of the cipher- 
text cache used. When r = 0 this means that no cache 
was used (only the current location is inserted during an 
update). For ease-of-use (i.e., so one can print out or 
copy down state information) we encoded all persistently 
stored data in hex, meaning the sizes of stored state are 
roughly twice larger than absolutely necessary. The use 
of asymmetric primitives by adeona-0.1 for the tamper- 
evident FSPRG functionality and the IBE scheme ac- 
count for its larger space utilization. 


Microbenchmarks. Space constraints limit the amount 
of data we can report, and so our focus here is on adeona- 
0.1. It uses more expensive cryptographic primitives (el- 
liptic curves supporting bilinear pairings), and we want 
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to assess whether the extensions relying on them hinder 
performance significantly. Table 3 gives running times 
in milliseconds/operation for the basic operations used 
by adeona-0.1. (We omit the times for non-panic en- 
cryption, decryption, update, and retrieve; these times 
were at most those of the related panic-mode opera- 
tions.) These benchmarks only used the light location- 
finding mechanism and each update was inserted to a 
single OpenDHT node. Each operation was timed for 
100 repetitions both using the clock system call (the CPU 
columns) and gettimeofday (the Wall columns); shown is 
the min/mean/median/max time over the successful tri- 
als. Where applicable, the number of time outs (due to 
DHT operations) are shown in the column labeled T/O. 
Note that the retrieve operations only include retrieval 
for a single update. These benchmarks show that the ex- 
tensions are not prohibitive: performance is dependent 
almost entirely on the speed of network operations. 


6.2 Geolocation accuracy 


As mentioned earlier, our system has been designed to 
convey various kinds of location information to the stor- 
age service. We then rely on previously proposed net- 
work measurement analysis techniques and/or database 
lookups to process the stored location information and 
derive a geographical estimate. The strengths and weak- 
nesses of such techniques are well-documented. We 
therefore focus our evaluation on the active client-based 
measurement technique described in Section 4 that at- 
tempts to identify a set of nearby passive landmarks 
given a large number of geographically distributed land- 
marks. 

First, we accumulated about 225412 open DNS 
servers by querying Internet search engines for dictio- 
nary words and collecting the DNS servers which re- 
sponded to lookups on the resulting hostnames. Next, 
we enumerated 8 643 Akamai nodes across the world by 
querying the DNS servers for the IP addresses of host- 
names known to resolve to Akamai edge servers (e.g., 
www.nba.com). Finally, 50 PlanetLab [11] nodes were 
used as stand-ins for lost or stolen devices across the 
United States. 

Having both targets and landmarks, we obtained 
round-trip time (RTT) measurements from the Planet- 
Lab nodes to the passive Akamai servers. The PlanetLab 
nodes were able to obtain measurements to 6 200 of the 
Akamai servers on our list. We could then evaluate our 
geolocation technique by running it over these measure- 
ments. Figure 3 plots the cumulative distributions of our 
results and the RTT to the actual closest Akamai node. 
We also plot the cumulative distribution of the RTT to 
an Akamai node as given by a simple DNS lookup for 
32 of our 50 targets (the other 18 nodes went down at 
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CPU Wall 
Operation Min | Mean | Median | Max Min Mean Median Max T/O 
Initialize core 210 329 330 460 215 367 348 1082 - 
Verify FSPRG state 340 456 470 610 351 494 474 1240 - 
Panic encryption 90 95 90 110 93 101 95 207 

Panic decryption 80 90 90 100 85 104 90 934 - 
r=0 440 559 570 700 612 1653 977 15 347 9 
Panic update r=10 440 543 545 680 818 2289 1311 20582 10 
r= 100 540 666 690 800 2953 | 12599 7439 165 950 5 

r=0 80 89 90 100 92 499 207 12003 7 
Panic retrieve r=10 80 87 90 100 93 705 335 9734 12 
r= 100 80 87 90 100 116 2458 1555 21734 5 





























Table 3: Time in milliseconds to perform basic operations in adeona-0.1. 


the time of measurement). Our technique performs bet- 
ter than Akamai’s own content delivery algorithms for 
more than 60% of the the targets we considered. In ad- 
dition, we observe that it can find an Akamai server at 
most 7 milliseconds away. 


6.3 Field trial 


We conducted a small field trial to gain experience 
with our implementation of Adeona, reveal potential is- 
sues with our designs, and quantitatively gauge the ef- 
ficacy of using OpenDHT as a remote storage facility. 
There were 11 participants each running the adeona- 
0.2.0 client with the same options: update rate param- 
eter of 0.002 (about 7 updates an hour on average), lo- 
cation cache of size r = 4, and spatial replication of 4 
(the core tries to insert each update to 4 DHT keys). The 
clients were instrumented to locally log all the location 
updates submitted over the course of the trial. At the end 
of the trial, we collected these client-side log files plus 
each owner’s copy of the initial state, and used this data 
to attempt to retrieve a week’s worth of updates? for each 
of the participants. 

Results are shown in the left table of Figure 3. Here ‘# 
Inserts’ refers to the total number of successful insertions 
into the DHT by the client in the week period. The ‘In- 
sert rate’ column measures the fraction of these inserts 
that were retrieved. The ‘# Updates’ column shows the 
total number of updates submitted by each client. Note 
that our replication mechanism means that each update 
causes the client to attempt 4 insertions of the location 
cache. The ‘Update rate’ column measures the percent- 
age of location caches retrieved. As can be seen, this 
fraction is usually larger than the fraction of inserts re- 
trieved, suggesting that replication across multiple DHT 
keys is beneficial. The “Locations Found’ column reports 
the number of unique locations (defined as distinct (in- 
ternal IP, external IP) pair) found during retrieval versus 
reported. The final column measures the time, in min- 
utes and seconds, that it took to perform retrieval for the 
user’s updates for the whole week (note that we paral- 
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lelized retrieval for each user). 

We observed that OpenDHT may return “no data” for 
a key even when, in fact, there is data stored under that 
key. (This was detected when doing multiple get requests 
for a key.) Indeed, the failure to find two of the user loca- 
tions was due to this phenomenon, and in fact repeating 
the retrieval operation found these locations as well. 


7 Deployment Settings: Hardware Sup- 
port and Dedicated Servers 


In Section 2, we highlighted several settings for de- 
vice tracking: internal corporate systems, third-party 
companies offering tracking services, and community- 
supported tracking for individuals. Each case differs in 
terms of what resources are available to both the tracking 
client and the remote storage. In Section 4 we built the 
Adeona system targeting a software client and OpenDHT 
repository, which works well for the third setting. Here 
we describe how our designs can work with other deploy- 
ment scenarios. 

A hardware-supported client can be deployed in sev- 
eral ways, including ASICs implementing client logic, 
trusted hardware modules (e.g., a TPM [35] or Intel’s 
Active Management technology), or worked into exist- 
ing system firmware components (e.g., a system BIOS). 
Hardware-support can be effectively used to ensure that 
the tracking client can only be disabled by the most so- 
phisticated thieves and, possibly, that the client has ac- 
cess to a private and tamper-free state. However, target- 
ing a system for use with a hardware-supported client 
adds to system requirements. While we do not work 
out all the (important) details of a hardware implemen- 
tation of Adeona’s client (leaving this to future work), 
we argue here that our techniques are amicable to this 
type of deployment. Adeona’s core (Section 3) is partic- 
ularly suited for implementation in hardware. It relies on 
a single cryptographic primitive, AES, which is highly 
optimized for both software and hardware. For example, 
recent research shows how to implement AES in only 
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User # Insert # Update | Locations | Retrieve 

Inserts Rate Updates Rate Found Time 
Ol 491 0.89 251 0.94 11/12 12m 06s 
2 02 632 0.89 327 0.94 3/3 16m 04s 
3 03 622 0.84 321 0.91 2/2 17m 04s 
& 04 543 0.87 274 0.95 5/5 15m 03s 
g 05 617 0.88 309 0.96 4/4 19m 04s 
a 06 234 0.85 123 0.90 4/4 15m 06s 
5 07 359 0.89 199 0.95 5/6 18m 04s 
oat: ; Factual || 08 420 0.85 220 0.92 7/7 14m 06s 
—adeonal 09 504 0.91 259 0.97 1/1 11m 06s 
—akamai 10 138 0.90 59 0.92 4/4 13m 04s 
05 e 10 15 i os a0 ae 11 302 0.81 175 0.91 6/6 14m 04s 





Shortest RTT to Akamai node (ms) 


Figure 3: (Left) The cumulative distribution of the shortest RTT (in milliseconds) to an Akamai node found by Adeona compared 
to the actual closest Akamai node and Akamai’s own content delivery algorithm. (Right) Field trial retrieval rates and retrieval 


times (in minutes and seconds). 


3400 gates (on a “grain of sand”) [15]. In its most basic 
form (without a location cache), the core only requires 
16 bytes of persistent storage. 


In settings where a third-party company offers track- 
ing services, the design requirements are more relaxed 
compared to a community-supported approach. Partic- 
ularly, such a company would typically offer dedicated 
remote storage servers. This would allow handling per- 
sistence issues server-side. Further, this kind of remote 
storage service is likely to provide better availability than 
DHTs, obviating the need to engineer the client to handle 
various kinds of service failures. Adeona is thus slightly 
over-engineered for this setting (e.g., we could dispense 
with the replication technique of Section 4). An interest- 
ing question that is raised in such a deployment setting is 
how to perform privacy-preserving access control. For 
obvious reasons, these remote storage facilities would 
want to restrict the parties able to insert data. If we 
use traditional authentication mechanisms, the authen- 
tication tag might reveal who is submitting the update. 
Thus one might think about using newer cryptographic 
primitives such as group or ring signatures [10, 31] that 
allow authentication while not revealing what member of 
a group is actually communicating the update. 


Corporations or other large organizations might opt 
to internally host storage facilities, as per Scenario 5 of 
Section 2. Again, dedicated storage servers ease design 
constraints, meaning Adeona can be simplified for this 
setting. But there is again the issue of access control. 
(Though in this setting existing corporate VPN’s, if these 
do not reveal the client’s identity, might be used.) On the 
other hand, this kind of deployment raises other interest- 
ing questions. Particularly, the privacy set is necessarily 
restricted to only employees of the corporation, and so an 
adversary might be able to aggregate information about 
overall employee location habits even if the adversary 
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cannot track individual employees. 


8 Extensions 


We describe several extensions for the Adeona system 
that highlight its versatility and extensibility. These in- 
clude: removing the reliance on synchronized clocks, 
tamper-evident FSPRGs for untrusted local storage, a 
panic mode of operation that does not rely on state, the 
use of anonymous channels, and enabling communica- 
tion from an owner back to a lost device. 


8.1 Avoiding synchronized time 


The Adeona system, as described in Section 4, utilizes a 
shared clock between the client and owner to ensure safe 
retrieval. This is realized straightforwardly if the client is 
loosely synchronized against an external clock (e.g., via 
NTP [27]). In deployment scenarios where the device 
cannot be guaranteed to maintain synchronization or the 
thief might maliciously modify the system clock, we can 
modify the client and retrieval process as follows. 
Whenever the client is executed, it reads the current 
state (which is now just the current cryptographic seed 
for the FSPRG and the cache) and computes the inter- 
update time 6 associated to the state. It then waits that 
amount of time before sending the next update and pro- 
gressing the state. For retrieval, the owner can still com- 
pute all of the inter-update times, and use these to esti- 
mate when a state was used to send an update. If the 
client runs continuously from initialization, then a state 
will be used when predicted by the sum of the inter- 
update times of earlier states. If the client is not run 
continuously from initialization, then a state might be 
used to send an update /ater (relative to absolute time) 
than predicted by the inter-update times. It is therefore 
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privacy-preserving for the owner to retrieve any states es- 
timated to be sent after the time at which the device was 
lost. The owner might also query prior states to search 
for relevant updates, but being careful not to go too far 
back (lest he begin querying for updates sent before the 
device was lost). 


8.2 Detection of client state tampering 


The Adeona system relies on the client’s state remain- 
ing unmodified. Compared to a (hypothetical) stateless 
client, this allows a new avenue for disabling the de- 
vice. To rectify this disparity between the ideal (in which 
an adversary has to disable/modify the client executable) 
and Adeona we design a novel cryptographic primitive, 
a tamper-evident FSPRG, that allows cryptographic val- 
idation of state. By adding this functionality to Adeona 
we remove this avenue for disabling tracking functional- 
ity. Moreover, we believe that tamper-evident FSPRGs 
are likely of independent interest and might find use in 
further contexts where untrusted storage is in use, e.g., 
when the Adeona core is implemented in hardware but 
the state is stored in memory accessible to an adversary. 

A straightforward construction would work as follows. 
The owner, during initialization, also generates a signing 
key and a verification key for a digital signature scheme. 
Then, the initialization routine generates the core’s val- 
ues S;,¢;,1,7; for each future state 7 that could be used by 
the client, and signs these values. The verification key 
and resulting signatures are placed in the client’s stor- 
age, along with the normal initial state. The client, to 
validate lack of tampering with FSPRG states, can verify 
the state’s s;,c;,1,7; values via the digital signature’s ver- 
ification algorithm and the (stored) verification key. Un- 
fortunately this approach requires a large amount of stor- 
age (linear in the number of updates that could be sent). 
Moreover, a very sophisticated thief could just mount a 
replacement attack: substitute his or her own state, ver- 
ification key, and signatures for the owner’s. Note this 
attack does not require modifying or otherwise interfer- 
ing with the client executable. We can do better on both 
accounts. 

To stop replacement attacks, we can use a trusted au- 
thority to generate a certificate for the owner’s verifica- 
tion key (which should also tie it to the device). Then the 
trusted authority’s verification key can be hard-coded in 
the client executable and be used to validate the owner’s 
(stored) verification key. To reduce the storage space 
required, we have the owner, during initialization, only 
sign the final state’s values. To verify, the client can seek 
forward with the FSPRG (without yet deleting the cur- 
rent state) to the final state and then verify the signature. 
(A counter can be used to denote how many states the 
client needs to progress to reach the final one.) Under 
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reasonable assumptions regarding the FSPRG (in partic- 
ular, that it’s difficult to find two distinct states that lead 
to the same future state) and the assumption that the dig- 
ital signature scheme is secure, no adversary will be able 
to generate a state that deviates from the normal progres- 
sion, yet verifies. A clever thief might try to roll the 
FSPRG forward in the normal progression, to cause a 
long wait before the next update. This can be defended 
against with a straightforward check relative to the cur- 
rent time: if the next update is too far away, then assume 
adversarial modification. We could also store the signa- 
tures of some fraction of the intermediate states in or- 
der to operate at different points of a space/computation 
trade-off. 


8.3 Private updates with tampered state 


If the client detects tampering with the state, then it can 
enter into a “panic” mode which does not rely on the 
stored state to send updates. One might have panic mode 
just send updates in the clear (because these locations 
are presumably not associated with the owner), but there 
can be reasons not to do this. For example, configuration 
errors by an owner could mistakenly invoke panic mode. 

Panic mode can still provide some protection for up- 
dates without relying on shared state, as follows. We as- 
sume the client and owner have access to an immutable, 
unique identification string 1D. In practice this ID could 
be the laptop’s MAC address. We also use a cryp- 
tographic hash function H: {0,1}* — {0,1}", such as 
SHA-256 for which h = 256 bits. Then pick a param- 
eter b € [0..h]. For each update, the client generates a 
sequence of indexes via J; = H(1||T || H(ID)|,), b = 
H(2||T || H(ID)|,), etc. Here T is the current date and 
H(ID)|, denotes hashing ID and then taking the first b 
bits of the result. (Varying the parameter b enables a 
simple time-privacy trade off known as “bucketization”’.) 

Location information can be encrypted using an 
anonymous identity-based encryption scheme [8]. Using 
a trusted key distribution center, each owner can receive 
a secret key associated to their device’s ID. (Note that 
the center will be able to decrypt updates, also.) Encryp- 
tion to the owner only requires ID. This is useful because 
then encryption does not require stored per-device state, 
under the presumption that ID is always accessible. The 
ciphertext can be submitted under the indices. The owner 
can retrieve these panic-mode updates by re-computing 
the indices using ID and the appropriate dates and then 
using trial decryption. 


8.4 Anonymous channels 


Systems such as Tor [14] implement anonymous chan- 
nels, which can be used to effectively obfuscate from re- 
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cipients the originating IP address of Internet traffic. The 
Adeona design can easily compose with any such sys- 
tem by transmitting location updates to the remote stor- 
age across the anonymous channel. The combination of 
Adeona with an anonymous routing system provides sev- 
eral nice benefits. It means that the storage provider and 
outsiders do not trivially see the originating IP address, 
meaning active fingerprinting attacks are prevented. Ad- 
ditionally, it merges the anonymity set of Adeona with 
that of the anonymous channel system. For example, 
even if there exists only a single user of Adeona, that 
user might nevertheless achieve some degree of location 
privacy using anonymous channels. 

On the other hand, attempting to use anonymous chan- 
nels without Adeona does not satisfy our system goals. 
The now more complex clients would not necessarily 
be suitable for some deployment settings (e.g. hardware 
implementations). It would force a reliance on a com- 
plex, distributed infrastructure in all deployment settings. 
This reliance is particularly bad in the corporate setting. 
Routing location updates through nodes not controlled 
by the company could actually decrease corporate pri- 
vacy: outsiders could potentially learn employee loca- 
tions (e.g., see [36]). Moreover, when analyzing how to 
utilize anonymous channels and meet our tracking and 
privacy goals, it is easy to see that even with the anony- 
mous channel one still benefits from Adeona’s mecha- 
nisms. Imagine a hypothetical system based on anony- 
mous channels. Because the storage provider is poten- 
tially adversarial, the system would still need to encrypt 
location information and so also provide an index to en- 
able efficient search of the remote storage. Because the 
source IP is hidden, one might utilize a static, anony- 
mous identifier. This would allow the storage provider 
to, at the very least, link update times to a single device, 
which leaks more information than if the indices are un- 
linkable. 


8.5 Sending commands to the device 


In situations where a device is lost, an owner might wish 
to not only retrieve updates from it, but also securely 
send commands back to it. For example, such a chan- 
nel would allow remotely deleting sensitive data. We 
can securely instantiate a full duplex channel using the 
remote storage as a bulletin board. An owner could post 
encrypted and signed messages to the remote storage un- 
der indices of future updates. The client, during an up- 
date, would first do a retrieve on the indices to be used 
for the update, thereby receiving the encrypted and au- 
thenticated commands. Standard encryption and authen- 
tication tools can be used, including using cryptographic 
keys derived from the FSPRG seed in use on the client. 
In terms of location privacy, the storage provider would 
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now additionally learn that two entities are communicat- 
ing via the bulletin board, but not which entities. 


9 Conclusion 


This paper develops mechanisms by which one can build 
privacy-preserving device tracking systems. These sys- 
tems simultaneously hide a device owner’s legitimately 
visited locations from powerful adversaries, while en- 
abling tracking of the device after it goes missing. More- 
over, we do so while using third party services that are 
not trusted in terms of location privacy. Our mecha- 
nisms are efficient and practical to deploy. Our client- 
side mechanisms are well-suited for hardware implemen- 
tations. This illustrates that not only can one circumvent 
a trade-off between security and privacy, but one can do 
so in practice for real systems. 


We implemented Adeona, a full privacy-preserving 
tracking system based on OpenDHT that allows for im- 
mediate, community-orientated deployment. Its core 
module, the cryptographic engine that renders location 
updates anonymous and unlinkable, can be easily used in 
further deployment settings. To evaluate Adeona, we ran 
a field trial to gain experience with a deployment on real 
user’s systems. Our conclusion is that our approach is 
sound and an immediately viable alternative to tracking 
systems that offer less (or no) privacy guarantees. Lastly, 
we also presented numerous extensions to Adeona that 
address a range of issues: disparate deployment settings, 
increased functionality, and improved security. The tech- 
niques involved, particularly our tamper-evident FSPRG, 
are likely of independent interest. 


Notes 


'EmailMe is a fictional system, though its functionality is based on 
products such as PC Phone Home [9] and Inspice [21]. 

2A flea market is a type of ad-hoc market where transactions are 
typically anonymous and done in cash. 

3 AllDevRec is a fictional company, though the services it offers are 
comparable to those advertised by Absolute Software [1], which has 
tracking software pre-installed in the BIOS of some Dell laptops. 

+A real example of such insider abuse is found in [20]. 

>The owner could download the entire database and do trial decryp- 
tion, but with many users this would be prohibitively expensive. 

One could also utilize as basic primitive a keyed hash function. 

7To be precise, the guarantee is that OpenDHT guarantees not to ex- 
pire a key-value pair before its time-to-live passes, barring some catas- 
trophic failure of the DHT service [30]. 

8Systems we built on had version 0.9.71 or later. We used SHA1, 
instead of the more secure SHA-256, due to its lack of implementation 
in OpenSSL 0.9.71 (the most recent version available for OSX). 

°To be precise, the search was for any update potentially sent over 
the course of 6 days and 23 hours. The final hour was omitted for sim- 
plicity since it avoided retrieving updates being expired by OpenDHT. 
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Abstract 


Remote error analysis aims at timely detection and rem- 
edy of software vulnerabilities through analyzing run- 
time errors that occur on the client. This objective can 
only be achieved by offering users effective protection 
of their private information and minimizing the perfor- 
mance impact of the analysis on their systems without 
undermining the amount of information the server can 
access for understanding errors. To this end, we propose 
in the paper a new technique for privacy-aware remote 
analysis, called Panalyst. Panalyst includes a client com- 
ponent and a server component. Once a runtime excep- 
tion happens to an application, Panalyst client sends the 
server an initial error report that includes only public in- 
formation regarding the error, such as the length of the 
packet that triggers the exception. Using an input built 
from the report, Panalyst server performs a taint analysis 
and symbolic execution on the application, and adjusts 
the input by querying the client about the information 
upon which the execution of the application depends. 
The client agrees to answer only when the reply does 
not give away too much user information. In this way, 
an input that reproduces the error can be gradually built 
on the server under the client’s consent. Our experimen- 
tal study of this technique demonstrates that it exposes a 
very small amount of user information, introduces neg- 
ligible overheads to the client and enables the server to 
effectively analyze an error. 


1 Introduction 


Remote analysis of program runtime errors enables 
timely discovery and patching of software bugs, and has 
therefore become an important means to improve soft- 
ware security and reliability. As an example, Microsoft 
is reported to fix 29 percent of all Windows XP bugs 
within Service Pack 1 through its Windows Error Re- 
porting (WER) utility [20]. Remote error analysis is 
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typically achieved by running an error reporting tool on 
a client system, which gathers data related to an applica- 
tion’s runtime exception (such as a crash) and transmits 
them to a server for diagnosis of the underlying software 
flaws. This paradigm has been widely adopted by soft- 
ware manufacturers. For example, Microsoft relies on 
WER to collect data should a crash happen to an applica- 
tion. Similar tools developed by the third party are also 
extensively used. An example is BugToaster [27], a free 
crash analysis tool that queries a central database using 
the attributes extracted from a crash to seek a potential 
fix. These tools, once complemented by automatic anal- 
ysis mechanisms [44, 34] on the server side, will also 
contribute to quick detection and remedy of critical se- 
curity flaws that can be exploited to launch a large-scale 
cyber attack such as Worm epidemic [47, 30]. 


The primary concern of remote error analysis is its pri- 
vacy impact. An error report may include private user 
information such as a user’s name and the data she sub- 
mitted to a website [9]. To reduce information leaks, er- 
ror reporting systems usually only collect a small amount 
of information related to an error, for example, a snippet 
of the memory around a corrupted pointer. This treat- 
ment, however, does not sufficiently address the privacy 
concern, as the snippet may still carry confidential data. 
Moreover, it can also make an error report less informa- 
tive for the purpose of rapid detection of the causal bugs, 
some of which could be security critical. To mitigate 
this problem, prior research proposes to instrument an 
application to log its runtime operations and submit the 
sanitized log once an exception happens [25, 36]. Such 
approaches affect the performance of an application even 
when it works normally, and require nontrivial changes 
to the application’s code: for example, Scrash [25] needs 
to do source-code transformation, which makes it un- 
suitable for debugging commodity software. In addition, 
these approaches still cannot ensure that sufficient infor- 
mation is gathered for a quick identification of critical 
security flaws. Alternatively, one can analyze a vulner- 
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able program directly on the client [29]. This involves 
intensive debugging operations such as replaying the in- 
put that causes a crash and analyzing an executable at 
the instruction level [29], which could be too intrusive to 
the user’s normal operations to be acceptable for a prac- 
tical deployment. Another problem is that such an anal- 
ysis consumes a large amount of computing resources. 
For example, instruction-level tracing of program execu- 
tion usually produces an execution trace of hundreds of 
megabytes [23]. This can hardly be afforded by the client 
with limited resources, such as Pocket PC or PDA. 


We believe that a good remote analyzer should help 
the user effectively control the information to be used in 
an error diagnosis, and avoid expensive operations on the 
client side and modification of an application’s source or 
binary code. On the other hand, it is also expected to 
offer sufficient information for automatic detection and 
remedy of critical security flaws. To this end, we pro- 
pose Panalyst, a new technique for privacy-aware remote 
analysis of the crashes triggered by network inputs. Pan- 
alyst aims at automatically generating a new input on the 
server side to accurately reproduce a crash that happens 
on the client, using the information disclosed according 
to the user’s privacy policies. This is achieved through 
collaboration between its client component and server 
component. When an application crashes, Panalyst client 
identifies the packet that triggers the exception and gen- 
erates an initial error report containing nothing but the 
public attributes of the packet, such as its length. Taking 
the report as a “taint” source, Panalyst server performs an 
instruction-level taint analysis of the vulnerable applica- 
tion. During this process, the server may ask the client 
questions related to the content of the packet, for exam- 
ple, whether a tainted branching condition is true. The 
client answers the questions only if the amount of infor- 
mation leaked by its answer is permitted by the privacy 
policies. The client’s answers are used by the server to 
build a new packet that causes the same exception to the 
application, and determine the property of the underlying 
bug, particularly whether it is security critical. 


Panalyst client measures the information leaks associ- 
ated with individual questions using entropy. Our pri- 
vacy policies use this measure to define the maximal 
amount of information that can be revealed for individ- 
ual fields of an application-level protocol. This treatment 
enables the user to effectively control her information 
during error reporting. Panalyst client does not perform 
any intensive debugging operations and therefore incurs 
only negligible overheads. It works on commodity appli- 
cations without modifying their code. These properties 
make a practical deployment of our technique plausible. 
In the meantime, our approach can effectively analyze a 
vulnerable application and capture the bugs that are ex- 
ploitable by malicious inputs. Panalyst can be used by 
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software manufacturers to demonstrate their “due dili- 
gence” in preserving their customers’ privacy, and by a 
third party to facilitate collaborative diagnosis of vulner- 
able software. 


We sketch the contributions of this paper as follows: 


e Novel framework for remote error analysis. We pro- 
pose a new framework for remote error analysis. 
The framework minimizes the impact of an analy- 
sis to the client’s performance and resources, lets 
the user maintain a full control of her information, 
and in the meantime provides her the convenience 
to contribute to the analysis the maximal amount of 
information she is willing to reveal. On the server 
side, our approach interleaves the construction of 
an accurate input for triggering an error, which is 
achieved through interactions with the client, and 
the analysis of the bug that causes the error. This 
feature allows our analyzer to make full use of the 
information provided by the client: even if such in- 
formation is insufficient for reproducing the error, it 
helps discover part of input attributes, which can be 
fed into other debugging mechanisms such as fuzz 
testing [35] to identify the bug. 


e Automatic control of information leaks. We present 
our design of new privacy policies to define the 
maximal amount of information that can be leaked 
for individual fields of an application-level proto- 
col. We also developed a new technique to enforce 
such policies, which automatically evaluates the in- 
formation leaks caused by responding to a question 
and then makes decision on whether to submit the 
answer in accordance with the policies. 


e Implementation and evaluations. We implemented 
a prototype system of Panalyst and evaluated it us- 
ing real applications. Our experimental study shows 
that Panalyst can accurately restore the causal input 
of an error without leaking out too much user infor- 
mation. Moreover, our technique has been demon- 
strated to introduce nothing but negligible over- 
heads to the client. 


The rest of the paper is organized as follows. Section 2 
formally models the problem of remote error analysis. 
Section 3 elaborates the design of Panalyst. Section 4 
describes the implementation of our prototype system. 
Section 5 reports an empirical study of our technique us- 
ing the prototype. Section 6 discusses the limitations of 
our current design. Section 7 presents the related prior 
research, and Section 8 concludes the paper and envi- 
sions the future research. 
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2 Problem Description 


We formally model the problem of remote error analysis 
as follows. Let P : S x I — S be a program that maps 
an initial process state s € S and an input 7 € J to an end 
state. A state here describes the data in memory, disk and 
register that are related to the process of P. A subset of 
S, E,, contains all possible states the process can end at 
after an input exploits a bug b. 

Once P terminates at an error state, the client runs 
an error reporting program G : I — R to generate a 
report r € R for analyzing P on the server. The re- 
port must be created under the constraints of the com- 
puting resources the client is able or willing to commit. 
Specifically, C, : {G} x I x R — measures the 
delay experienced by the user during report generation, 
C,: {G} x I x R — § measures the storage overhead, 
and C, : {G} x I x R —> ® measures the bandwidth 
used for transmitting the report. To produce and submit a 
report r € Ff, the computation time, storage consumption 
and bandwidth usage must be bounded by certain thresh- 
olds: formally, (Ci(G,i,r) < The) A (Cs(G,i,r) < 
Ths) \(Cw(G,i,r) < Thw), where Thy, Th, and Thy 
represent the thresholds for time, storage space and band- 
width respectively. In addition, r is allowed to be sub- 
mitted only when the amount of information it carries is 
acceptable to the user. This is enforced using a function 
L: Rx I — § that quantifies the information leaked 
out by r, and a threshold Th;. Formally, we require 
L(r, i) < Thy. 

The server runs an analyzer D : R — I to diagnose 
the vulnerable program P. D constructs a new input us- 
ing r to exploit the same bug that causes the error on 
the client. Formally, given P(t) € E, and r = G(i), 
the analyzer identifies another input 2’ from r such that 
P(i’') € Ep. This is also subject to resource constraints. 
Specifically, let C; : {D} x Rx I — ¥ be a function that 
measures the computation time for running D and C% : 
{D} x Rx I — & that measures the storage overhead. 
We have: (Ci(D,r, i’) < Thi) A (CL(D,1,7’) < Thi), 
where Th}, and Th/, are the server’s thresholds for time 
and space respectively. 

A solution to the above problem is expected to achieve 
three objectives: 


e Low client overheads. A practical solution should 
work effectively under very small Thi, Th; and 
Thy. Remote error analysis aims at timely de- 
tecting critical security flaws, which can only be 
achieved when most clients are willing to collabo- 
rate in most of the time. However, this will not hap- 
pen unless the client-side operations are extremely 
lightweight, as clients may have limited resources 
and their workloads may vary with time. Actually, 
customers could be very sensitive to the overheads 
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brought in by error reporting systems. For example, 
advice has been given to turn off WER on Windows 
Vista and Windows Mobile to improve their perfor- 
mance [12, 17, 13]. Therefore, it is imaginable that 
many may stop participating in error analysis in re- 
sponse to even a slight increase of overheads. As 
a result, the chance to catch dangerous bugs can be 
significantly reduced. 


e Control of information leaks. The user needs to 
have a full control of her information during an er- 
ror analysis. Otherwise, she may choose not to par- 
ticipate. Indispensable to this objective is a well- 
constructed function L that offers the user a reason- 
able measure of the information within an error re- 
port. In addition, privacy policies built upon L and 
a well-designed policy enforcer will automate the 
information control, thereby offering the user a re- 
liable and convenient way to protect her privacy. 


e Usability of error report. Error reports submitted 
by the user should contain ample information to al- 
low a new input 7’ to be generated within a short 
period of time (small Th}) and at a reasonable stor- 
age overhead (small Th’). The reports produced 
by the existing systems include little information, 
for example, a snapshot of the memory around a 
corrupted pointer. As a result, an analyzer may 
need to exhaustively explore a vulnerable program’s 
branches to identify the bug that causes the error. 
This process can be very time-consuming. To im- 
prove this situation, it is important to have a report 
that gives a detailed description about how an ex- 
ploit happens. 


In Section 3, we present an approach that achieves 
these objectives. 


3 Our Approach 


In this section, we first present an overview of Panalyst 
and then elaborate on the designs of its individual com- 
ponents. 


3.1 Overview 


Panalyst has two components, client and server. Panalyst 
client logs the packets an application receives, notifies 
the server of its runtime error, and helps the server ana- 
lyze the error by responding to its questions as long as 
the answers are permitted by the user’s privacy policies. 
Panalyst server runs an instruction-level taint analysis on 
the application’s executable using an empty input, and 
evaluates the execution symbolically [37] in the mean- 
time. Whenever the server encounters a tainted value that 
affects the choice of execution paths or memory access, 
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Figure 1: The Design of Panalyst. 


it queries the client using the symbolic expression of that 
value. From the client’s answer, the server uses a con- 
straint solver to compute the values of the input bytes 
that taint the expression. We illustrate the design of our 
approach in Figure 1. 


If (stremp(conn—>requestMethod, “POST” ) = 0) { 
buf = malloc (conn->ContentLength+1024) ; 
for (len=0;;) { 
recv(sd, buftlen, 1, 0); 
lent+; 


if (buf[len-1] = ‘\n’ ) break; 


aSBNAIDOO FP wD 


Figure 2: An Illustrative Example. 


An example. Here we explain how Panalyst works 
through an example, a program described in Figure 2. 
The example is a simplified version of Null-HTTPd [8]. 
It is written in C for illustration purpose: Panalyst ac- 
tually is designed to work on binary executables. The 
program first checks whether a packet is an HTTP POST 
request. If so, it allocates a buffer with the size com- 
puted by adding 1024 to an integer derived from the 
Content-Length field and moves the content of the 
request to that buffer. A problem here is that a buffer 
overflow can happen if Content-Length is set to be 
negative, which makes the buffer smaller than expected. 
When this happens, the program may crash as a result of 
writing to an illegal address or being terminated by an er- 
ror detection mechanism such as GLIBC error detection. 

Panalyst client logs the packets recently received by 
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the program. In response to a crash, the client iden- 
tifies the packet being processed and notifies Panalyst 
server of the error. The server then starts analyzing the 
vulnerable program at instruction level using an empty 
HTTP request as a taint source. The request is also de- 
scribed by a set of symbols, which the server uses to 
compute a symbolic expression for the value of every 
tainted memory location or register. When the execution 
of the program reaches Line | in Figure 2, the values 
of the first four bytes on the request need to be revealed 
so as to determine the branch the execution should fol- 
low. For this purpose, the server sends the client a ques- 
tion: “B,BoB3B,4 = *POST’?”’, where B; represents the 
jth byte on the request. The client checks its privacy 
policies, which defines the maximal number of bits of 
information allowed to be leaked for individual HTTP 
field. In this case, the client is permitted to reveal the 
keyword POST that is deemed nonsensitive. The server 
then fills the empty request with these letters and moves 
on to the branch determined by the client’s answer. The 
instruction on Line 2 calls malloc. The function ac- 
cesses memory using a pointer built upon the content of 
Content-Length, which is tainted. To enable this 
memory access, the server sends the symbolic expression 
of the pointer to the client to query its concrete value. 
The client’s reply allows the server to add more bytes to 
the request it is working on. Finally, the execution hits 
Line 3, a loop to move request content to the buffer al- 
located through malloc. The loop is identified by the 
server from its repeated instruction pattern. Then, a ques- 
tion is delivered to the client to query its exit condition: 
“ where is the first byte B; = ‘\n’?”. This question con- 
cerns request content, a field on which the privacy poli- 
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cies forbid the client to leak out more than certain amount 
of information. Suppose that threshold is 5 bytes. To an- 
swer the question, only one byte needs to be given away: 
the position of the byte ‘\n’. Therefore, the client an- 
swers the question, which enables the server to construct 
a new packet to reproduce the crash. 

The performance of an analysis can be improved by 
sending the server an initial report with all the fields 
that are deemed nonsensitive according the user’s privacy 
policies. In the example, these fields include keywords 
such as ‘POST’ and the Content-Length field. This 
treatment reduces the communication overheads during 
an analysis. 


Threat model. We assume that the user trusts the in- 
formation provided by the server but does not trust her 
data with the server. The rationale behind this assump- 
tion is based upon the following observations. The own- 
ers of the server are often software manufacturers, who 
have little incentive to steal their customers’ information. 
What the user does not trust is the way in which those 
parties manage her data, as improper management of the 
data can result in leaks of her private information. Ac- 
tually, the same issue is also of concern to those owners, 
as they could be reluctant to take the liability for protect- 
ing user information. Therefore, the client can view the 
server as a benign though unreliable partner, and take ad- 
vantage of the information it discovers from the vulner- 
able program to identify sensitive data, which we elabo- 
rate in Section 3.2. 

Note that this assumption is not fundamental to Pana- 
lyst: more often than not, the client is capable of identi- 
fying sensitive data on its own. As an example, the afore- 
mentioned analysis on the program in Figure 2 does not 
rely on any trust in the server. Actually, the assumption 
only serves an approach for defining fine-grained privacy 
policies in our research (Section 3.2), and elimination of 
the assumption, though may lead to coarser-grained poli- 
cies under some circumstances, will not invalidate the 
whole approach. 


3.2 Panalyst Client 


Panalyst client is designed to work on the computing 
devices with various resource constraints. Therefore, it 
needs to be extremely lightweight. The client also in- 
cludes a set of policies for protecting the user’s privacy 
and a mechanism to enforce them. We elaborate such a 
design as follows. 


Packet logging and error reporting. Panalyst client in- 
tercepts the packets received by an application, extracts 
their application-level payloads and saves them to a log 
file. This can be achieved either through capturing pack- 
ets at network layer using a sniffer such as Wireshark [1], 
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or by interposing on the calls for receiving data from net- 
work. We chose the latter for prototyping the client: in 
our implementation, an application’s socket calls are in- 
tercepted using pt race [10] to dump the application- 
level data to a log. The size of the file is bounded, and 
therefore only the most recent packets are kept. 


When a serious runtime error happens, the process of 
a vulnerable program may crash, which triggers our error 
analysis mechanism. Runtime errors can also be detected 
by the mechanisms such as GLIBC error detection, Win- 
dows build-in diagnostics [11] or other runtime error de- 
tection techniques [28, 21]. Once an error happens to 
an application, Panalyst client identifies the packets it is 
working on. This is achieved in our design by looking at 
all the packets within one TCP connection. Specifically, 
the client marks the beginning of a connection once ob- 
serving an accept call from the application and the end 
of the connection when it detects close. After an ex- 
ception happens, the client concatenates the application- 
level payloads of all the packets within the current con- 
nection to form a message, which it uses to talk to the 
server. For simplicity, our current design focuses on the 
error triggered by network input and assumes that all 
information related to the exploit is present in a single 
connection. Panalyst can be extended to handle the er- 
rors caused by other inputs such as data from a local 
file through logging and analyzing these inputs. It could 
also work on multiple connections with the support of 
the state-of-art replay techniques [43, 32] that are capa- 
ble of replaying the whole application-layer session to 
the vulnerable application on the server side. When a 
runtime error occurs, Panalyst client notifies the server 
of the type of the error, for example, segmentation fault 
and illegal instruction. Moreover, the client can ship to 
the server part of the message responsible for the error, 
given such information is deemed nonsensitive according 
to the user’s privacy policies. 


After reporting to the server a runtime error, Panalyst 
client starts listening to a port to wait for the questions 
from the server. Panalyst server may ask two types of 
questions, either related to a tainted branching condi- 
tion or a tainted pointer a vulnerable program uses to 
access memory. In the first case, the client is supposed 
to answer “yes” or “no” to the question described by a 
symbolic inequality: C(Byij,..., Brimj) < 0, where 
Byyj (A < J < m) is the symbol for the k[j]th byte 
on the causal message. In the second case, the client is 
queried about the concrete value of a symbolic pointer 
S(Bxiij,-.+5-Brimj). These questions can be easily ad- 
dressed by the client using the values of these bytes on 
the message. However, the answers can be delivered to 
the server only after they are checked against the user’s 
privacy policies, which we describe below. 
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Privacy policies. Privacy policies here are designed to 
specify the maximal amount of information that can be 
given away during an error analysis. Therefore, they 
must be built upon a proper measure of information. 
Here, we adopt entropy [48], a classic concept of infor- 
mation theory, as the measure. Entropy quantifies uncer- 
tainty as number of bits. Specifically, suppose that an 
application field A is equally likely to take one of m dif- 
ferent values. The entropy of A is computed as log, m 
bits. If the client reveals that A makes a path condition 
true, which reduces the possible values the field can have 
to a proportion p of m, the exposed information is quan- 
tified as: logy m — logy pm = — logy p bits. 

The privacy policies used in Panalyst define the max- 
imal number of bytes of the information within a pro- 
tocol field that can be leaked out. The number here is 
called leakage threshold. Formally, denote the leakage 
threshold for a field A by 7. Suppose the server can in- 
fer from the client’s answers that A can take a proportion 
p of all possible values of that field. The privacy pol- 
icy requires that the following hold: — log, p < +. For 
example, a policy can specify that no more than 2 bytes 
of the URL information within an HTTP request can be 
revealed to the server. This policy design can achieve 
a fine-grained control of information. As an example, 
let us consider HTTP requests: protocol keywords such 
as GET and POST are usually deemed nonsensitive, and 
therefore can be directly revealed to the server; on the 
other hand, the URL field and the cookie field can be 
sensitive, and need to be protected by low leakage thresh- 
olds. Panalyst client includes a protocol parser to parti- 
tion a protocol message into fields. The parser does not 
need to be precise: if it cannot tell two fields apart, it just 
treats them as a single field. 





A problem here is that applications may use closed 
protocols such as ICQ and SMB whose specifications are 
not publically available. For these protocols, the whole 
protocol message has to be treated as a single field, which 
unfortunately greatly reduces the granularity of control 
privacy policies can have. A solution to this problem is to 
partition information using the parameters of API (such 
as Linux kernel API, GLIBC or Windows API) functions 
that work on network input. For example, suppose that 
the GLIBC function fopen builds its parameters upon 
an input message; we can infer that the part of the mes- 
sage related to file access modes (such as ‘read’ and 
‘write’) can be less sensitive than that concerning file 
name. This approach needs a model of API functions and 
trust in the information provided by the server. Another 
solution is to partition an input stream using a set of to- 
kens and common delimiters such as ‘\n’. Such tokens 
can be specified by the user. For example, using the to- 
ken ‘secret’ and the delimiter “‘.’, we can divide the 
URL ‘www.secretservice.gov’ into the follow- 
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ing fields: ‘www’, ‘.’, ‘secretservice’ and ‘gov’. 
Upon these fields, different leakage thresholds can be de- 
fined. These two approaches can work together and also 
be applied to specify finer-grained policies within a pro- 
tocol field when the protocol is public. 


To facilitate specification of the privacy policies, Pan- 
alyst can provide the user with policy templates set by 
the expert. Such an expert can be any party who has the 
knowledge about fields and the amount of information 
that can be disclosed without endangering the content of 
a field. For example, people knowledgeable about the 
HTTP specifications are in the position to label the fields 
like ‘www’ as nonsensive and domain names such as 
“secretservice.gov’ as sensitive. Typically, pro- 
tocol keywords, delimiters and some API parameters can 
be treated as public information, while the fields such 
as those including the tokens and other API parameters 
are deemed sensitive. A default leakage threshold for 
a sensitive field can be just a few bytes: for example, 
we can allow one or two bytes to be disclosed from a 
domain-name field, because they are too general to be 
used to pinpoint the domain name; as another example, 
up to four bytes can be exposed from a field that may 
involve credit-card numbers, because people usually tol- 
erate such information leaks in real life. Note that we 
may not be able to assign a zero threshold to a sensitive 
field because this can easily cause an analysis to fail: to 
proceed with an analysis, the server often needs to know 
whether the field contains some special byte such as a 
delimiter, which gives away a small amount of informa- 
tion regarding its content. These policy templates can be 
adjusted by a user to define her customized policies. 


Policy enforcement. To enforce privacy policies, we 
need to quantify the information leaked by the client’s 
answers. This is straightforward in some cases but less 
so in others. For example, we know that answering ‘yes’ 
to the question “B, B2B3B, = ‘POST’?” in Figure 2 
gives away four bytes; however, information leaks can 
be more difficult to gauge when it comes to the ques- 
tions like “B; x By < 256? ”, where B; and By 
indicates the jth and the kth bytes on a message re- 
spectively. Without loss of generality, let us consider a 
set of bytes (Byiij,---; Bx[mj) of a protocol message, 
whose concrete values on the message makes a condi- 
tion “C(Byyy,---,Befmj) < 0” true. To quantify the 
information an answer to the question gives away, we 
need to know p, the proportion of all possible values 
these bytes can take that make the condition true. Find- 
ing p is nontrivial because the set of the values these 
bytes can have can be very large, which makes it im- 
practical to check them one by one against the inequal- 
ity. Our solution to the problem is based upon the classic 
statistic technique for estimating a proportion in a popu- 
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lation. Specifically, we randomly pick up a set of values 
for these bytes to verify a branching condition and re- 
peat the trial for n times. From these n trials, we can 
estimate the proportion p as > where x is the number 
of trials in which the condition is true. The accuracy of 
this estimate is described by the probability that a range 
of values contain the true value of p. The range here 
is called confidence interval and the probability called 
confidence level. Given a confidence interval and a con- 
fidence level, standard statistic technique can be used to 
determine the size of samples n [2]. For example, sup- 
pose the estimate of p is 0.3 with a confidence inter- 
val +0.5 and a confidence level 0.95, which intuitively 
means 0.25 < p < 0.35 with a probability 0.95; in 
this case, the number of trials we need to play is 323. 
This approach offers an approximation of information 
leaks: in the prior example, we know that with 0.95 con- 
fidence, information being leaked will be no more than 
— log, 0.25 = 4 bits. Using such an estimate and a pre- 
determined leakage threshold, a policy enforcer can de- 
cide whether to let the client answer a question. 





3.3. Panalyst Server 


Panalyst server starts working on a vulnerable applica- 
tion upon receiving an initial error report from the client. 
The report includes the type of the error, and other non- 
sensitive information such as the corrupted pointer, the 
lengths of individual packets’ application-level payloads 
and the content of public fields. Based upon it, the server 
conducts an instruction-level analysis of the application’s 
executable, which we elaborate as follows. 


Taint analysis and symbolic execution. Panalyst server 
performs a dynamic taint analysis on the vulnerable pro- 
gram, using a network input built upon the initial re- 
port as a taint source. The input involves a set of pack- 
ets, whose application-layer payloads form a message 
characterized by the same length as the client’s message 
and the information disclosed by the report. The server 
monitors the execution of the program instruction by in- 
struction to track tainted data according to a set of taint- 
propagation rules. These rules are similar to those used 
in other taint-analysis techniques such as RIFLE [51], 
TaintCheck [44] and LIFT [45], examples of which are 
presented in Table 1. Along with the dynamic analysis, 
the server also performs a symbolic execution [37] that 
statically evaluates the execution of the program through 
interpreting its instructions, using symbols instead of real 
values as input. Each symbol used by Panalyst represents 
one byte on the input message. Analyzing the program 
in this way, we can not only keep close track of tainted 
data flows, but also formulate a symbolic expression for 
every tainted value in memory and registers. 

Whenever the execution encounters a conditional 
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branching with its condition tainted by input symbols, 
the server sends the condition as a question to the client 
to seek answer. With the answer from the client, the 
server can find hypothetic values for these symbols using 
a constraint solver. For example, a “no” to the question 
B; = ‘\n’ may result in a letter ‘a’ to be assigned to the 
ith byte on the input. To keep the runtime data consis- 
tent with the hypothetic value of symbol B;, the server 
updates all the tainted values related to B; by evaluat- 
ing their symbolic expressions with the hypothetic value. 
It is important to note that 6; may appear in multiple 
branching conditions (C; < 0,..., Cy < 0). Without 
loss of generality, suppose all of them are true. To find 
a value for B;, the constraint solver must solve the con- 
straint (C; <0)A...A (Cy < 0). The server also needs 
to “refresh” the tainted values concerning B; each time 
when a new hypothetic value of the symbol comes up. 


The server also queries the client when the program 
attempts to access memory through a pointer tainted by 
input symbols (Bj), --- ; Be{my). In this case, the server 
needs to give the symbolic expression of the pointer 
S(Bxiy,--+, Brimj) to the client to get its value v, and 
solve the constraint S(Bxi1j,.--,Brimj) = v to find 
these symbols’ hypothetic values. Query of a tainted 
pointer is necessary for ensuring the program’s correct 
execution, particularly when a write happens through 
such a pointer. It is also an important step for reliably 
reproducing a runtime error, as the server may need to 
know the value of a pointer, or at least its range, to deter- 
mine whether an illegal memory access is about to occur. 
However, this treatment may disclose too much user in- 
formation, in particular when the pointer involves only 
one symbol: a “yes” to such a question often exposes 
the real value of that symbol. Such a problem usually 
happens in a string-related GLIBC function, where let- 
ters on a string are used as offsets to look up a table. 
Our solution is to accommodate symbolic pointers in our 
analysis if such a pointer carries only one symbol and is 
used to read from a memory location. This approach can 
be explicated through an example. Consider the instruc- 
tion “MOV EAX, [ESI+CL]”, where CL is tainted by 
an input byte B;. Instead of directly asking the client 
for the value of ESI+CL, which reveals the real value of 
B;, the server gathers the bytes from the memory loca- 
tions pointed by (EST+0, ESI+1,..., EST+ 255) to 
form a list. The list is used to prepare a question should 
EAX get involved in a branching condition such as “CMP 
EAX, 1”. In this case, the server generates a query in- 
cluding [ESI+CL], which is the symbolic expression 
of EAX, the value of EST, the list and the condition. In 
response to the query, the client uses the real value of 
B, and the list to verify the condition and answer either 
“yes” or “no”, which enables the server to identify the 
right branch. 
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Table 1: Examples of the Taint Rules. 














Instruction Category | Taint Propagation Examples 

data movement (1) taint is propagated to the destination if the source is tainted, | mov eax, ebx; push eax; 
(2) the destination operand is not tainted if the source operand | call 0x4080022; 
is not tainted. lea ebx, ptr [ecxt10] 

arithmetic (1) taint is propagated to the destination if the source is tainted, and eax, ebx; inc ecx; 
(2) the EFLAGS is also regarded as a destination operand. shr eax,0x8 

address calculation an address is tainted if any element in the address calculationis | mov ebx, dword ptr 














tainted [ecx+2*ebxt+0x08] 
conditional jump regard EFLAGS as a source operand jz 0x0746323; 
jnle 0x878342; jg 0x405687 
compare regard EFLAGS as a destination operand cmp eax,ebx;test eax,eax 











The analysis stops when the execution reaches a state 
where a runtime error is about to happen. Examples 
of such a state include a jump to an address outside 
the process image or an illegal instruction, and mem- 
ory access through an illegal pointer. When this hap- 
pens, Panalyst server announces that an input reproduc- 
ing the error has been identified, and can be used for 
further analysis of the underlying bug and generation of 
signatures [52, 50, 39] or patches [49]. Our analysis also 
contributes to a preliminary classification of bugs: if the 
illegal address that causes the error is found to be tainted, 
we have a reason to believe that the underlying bug can 
be exploited remotely and therefore is security critical. 


Reducing communication overhead. A major concern 
for Panalyst seems to be communication overhead: the 
server may need to query the client whenever a tainted 
branching condition or a tainted pointer is encountered. 
However, in our research, we found that the bandwidth 
consumed in an analysis usually is quite small, less 
than a hundred KB during the whole analysis. This is 
because the number of tainted conditions and pointers 
can be relatively small in many programs, and both the 
server’s questions and the client’s answers are usually 
short. Need for communication can be further reduced 
if an initial error report supplies the server with a suffi- 
cient amount of public information regarding the error. 
However, the performance of the server and the client 
will still be affected when the program intensively oper- 
ates on tainted data, which in many cases is related to 
loop. 


A typical loop that appears in many network-facing 
applications is similar to the one in the example (Line 
6 of Figure 2). The loop compares individual bytes in 
a protocol field with a delimiter such as ‘\n’ or ‘’ to 
identify the end of the field. If we simply view the loop as 
a sequence of conditional branching, then the server has 
to query the client for every byte within that field, which 
can be time consuming. To mitigate this problem, we 
designed a technique in our research to first identify such 
a loop and then let client proactively scan its message 
to find the location of the first string that terminates the 
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loop. We describe the technique below. 


The server monitors a tainted conditional branching 
that the execution has repeatedly bumped into. When 
the number of such encounters exceeds a threshold, we 
believe that a loop has been identified. The step value 
of that loop can be approximated by the difference be- 
tween the indices of the symbols that appear in two con- 
secutive evaluations of the condition. For example, con- 
sider the loop in Figure 2. If the first time the execu- 
tion compares B; with ‘\n’ and the second time it tries 
B;+1, we estimate the step as one. The server then sends 
a question to the client, including the loop condition 
C(Byny,---,-Brtmj) and step estimates j11), -- - , Axfm- 
The client starts from the k[7]th byte (1 < 4 < m) to scan 
its message every Aj,;] bytes, until it finds a set of bytes 
(Bury + +++ Bipmj) that makes the condition false. The 
positions of these bytes are shipped to the server. As a 
result, the analysis can evaluate the loop condition using 
such information, without talking to the client iteration 
by iteration. 


The above technique only works on a simple loop 
characterized by a constant step value. Since such a 
loop frequently appears in network-facing applications, 
our approach contributes to significant reduction of com- 
munication when analyzing these applications. Devel- 
opment of a more general approach for dealing with the 
loops with varying step size is left as our future research. 
Another problem of our technique is that the condition it 
identifies may not be a real loop condition. However, this 
does not bring us much trouble in general, as the penalty 
of such a false positive can be small, including nothing 
but the requirement for the client to scan its message and 
disclosure of a few bytes that seem to meet the exit con- 
dition. If the client refuses to do so, the analysis can 
still continue through directly querying the client about 
branching conditions. 


Improving constraint-solving performance. Solving 
a constraint can be time consuming, particularly when 
the constraint is nonlinear, involving operations such as 
bitwise AND, OR and XOR. To maintain a valid run- 
time state for the program under analysis, Panalyst server 
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needs to run a constraint solver to update hypothetic sym- 
bol values whenever a new branching condition or mem- 
ory access is encountered. This will impact the server’s 
performance. In our research, we adopted a very sim- 
ple strategy to mitigate this impact: we check whether 
current hypothetic values satisfy a new constraint before 
solving the constraint. This turns out to be very effective: 
in many cases, we found that symbol values good for an 
old constraint also work for a new constraint, which al- 
lows us to skip the constraint-solving step. 


4 Implementation 


We implemented a prototype of Panalyst under Linux, in- 
cluding its server component and client component. The 
details of our implementation are described in this sec- 
tion. 


Message logging. We adopted ptrace to dump the 
packet payloads an application receives. Specifically, 
ptrace intercepts the system call socketcall() 
and parses its parameters to identify the location of an 
input buffer. The content of the buffer is dumped to a log 
file. We also labels the beginning of a connection when 
an accept () is observed and the end of the connection 
when there is a close (). The data between these two 
calls are used to build a message once a runtime excep- 
tion happens to the application. 


Estimate of information leaks. To evaluate the infor- 
mation leaks caused by answering a question, our imple- 
mentation first generates a constraint that is a conjunction 
of all the constraints the client receives that are directly 
or transitively related to the question, and then samples 
values of the constraint using the random values of the 
symbols it contains. We set the number of samples to 
400, which achieves a confidence interval of +0.05 and a 
confidence level of 0.95. A problem here is that the gran- 
ularity of the control here could be coarse, as 400 sam- 
ples can only represent loss of one byte of information. 
When this happens, our current implementation takes a 
conservative treatment to assume that all the bytes in a 
constraint are revealed. A finer-grained approach can be 
restoring the values of the symbols byte by byte to re- 
peatedly check information leaks, until all the bytes are 
disclosed. An evaluation of such an approach is left as 
our future work. 





Error analyzer. We implemented an error analyzer as 
a Pin tool that works under Pin’s Just-In-Time (JIT) 
mode [40]. The analyzer performs both taint analysis 
and symbolic execution on a vulnerable application, and 
builds a new input to reproduce the runtime error that 
occurred on the client. The analyzer starts from a mes- 
sage that contains nothing but zeros and has the same 
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length as the client’s input, and designates a symbol to 
every byte on that message. During the analysis, the 
analyzer first checks whether a taint will be propagated 
by an instruction and only symbolically evaluates those 
whose operands involve tainted bytes. Since many in- 
structions related to taint propagation use the informa- 
tion of EFLAGS, the analyzer also takes this register as 
a source operand for these instructions. Once an instruc- 
tion’s source operand is tainted, symbolic expressions are 
computed for the destination operand(s). For example, 
consider the instruction add eax, ebx, where ebx is 
tainted. Our analyzer first computes a symbolic expres- 
sion Bey: + Veax, Where Bey, is an expression for ebx 
and Veqr is the value of eax, and then generates another 
expression for EF LAGS because the result of the opera- 
tion affects Flag OF, SF, ZF, AF, CF, PF. 


Whenever a conditional jump is encountered, the 
server queries the client about EFLAGS. To avoid ask- 
ing the client to give away too much information, such 
a query only concerns the specific flag that affects that 
branching, instead of the whole status of EFLAGS. 
For example, consider the following branching: cmp 
eax, ebx and then jz 0x33fd740. In this case, the 
server’s question is only limited to the status of ZF, 
which the branching condition depends on, though the 
comparison instruction also changes other flags such as 
SF and CF. 











Constraint solver. Our implementation uses Yices [33] 
to solve constraints so as to find the hypothetic values 
for individual symbols. These values are important to 
keeping the application in a state that is consistent with 
its input. Yices is a powerful constraint solver which 
can handle many nonlinear constraints. However, there 
are situations when a constraint is so complicated that 
its solution cannot be obtained within a reasonable time. 
When this happens, we adopted a strategy that gradually 
inquires the client about the values of individual sym- 
bols to simplify the constraint, until it becomes solvable 
by the constraint solver. 


Data compression. We implemented two measures to 
reduce the communication between the client and the 
server. The first one is for processing the questions that 
include the same constraints except input symbols. Our 
implementation indexes each question the server sends 
to the client. Whenever the server is about to ask a ques- 
tion that differs from a previous one only in symbols, it 
only transmits the index of the old question and these 
symbols. This strategy is found to be extremely effec- 
tive when the sizes of the questions become large: in 
our experiment, a question with 8KB was compressed to 
52 bytes. The strategy also complements our technique 
for processing loops: for a complicated loop with vary- 
ing steps which the technique cannot handle, the server 
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needs to query the client iteratively; however, the sizes of 
these queries can be very small as they are all about the 
same constraint with different symbols. The second mea- 
sure is to use a lightweight real-time compression algo- 
rithm to reduce packet sizes. The algorithm we adopted 
is minilzo [6], which reduced the bandwidth consump- 
tion in our experiments to less than 100 KB for an anal- 
ysis, at a negligible computational overhead. 


5 Evaluation 


In this section, we describe our experimental study of 
Panalyst. The objective of this study is to understand the 
effectiveness of our technique in remote error analysis 
and protection of the user’s privacy, and the overheads 
it introduces. To this end, we evaluated our prototype 
using 6 real applications and report the outcomes of these 
experiments here. 

Our experiments were carried out on two Linux work- 
stations, one as the server and the other as the client. 
Both of them were installed with Redhat Enterprise 4. 
The server has a 2.40GHz Core 2 Duo processor and 
3GB memory. The client has a Pentium 4 1.3GHz pro- 
cessor and 256MB memory. 


5.1 Effectiveness 


We ran Panalyst to analyze the errors that occurred 
in 6 real applications, including Newspost [7], Open- 
VMPS [19], Null-HTTPd (Nullhttpd) [8], Sumus [15], 
Light HTTPd [5] and ATP-HTTPd [3]. The experimental 
results are presented in Table 2. These applications con- 
tain bugs that are subject to stack-based overflow, format 
string error and heap-based overflow. The errors were 
triggered by a single or multiple input packets on the 
client and analyzed on the server. As a result, new pack- 
ets were gradually built from an initial error report and 
interactions with the client to reproduce an error. This 
was achieved without leaking too much user information. 
We elaborate our experiments below. 


Newspost. Newspost is a Usenet binary autoposter for 
Unix and Linux. Its version 2.1.1 and earlier has a bug 
subject to stack-based overflow: specifically, a buffer in 
the socket_get line () function can be overrun by a 
long string without a newline character. In our experi- 
ment, the application was crashed by a packet of 2KB. 
After this happened, the client sent the server an initial 
error report that described the length of the packet and 
the type of the error. The report was converted into an 
input to an analysis performed on the application, which 
included an all-zero string of 2KB. During the analy- 
sis, the server identified a loop that iteratively searched 
for ‘Oxa’, the newline symbol, as a termination condi- 
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tion for moving bytes into a buffer, and questioned the 
client about the position at which the byte first appeared. 
The byte actually did not exist in the client’s packet. 
As a result, the input string overflowed the buffer and 
was spilled on an illegal address to cause a segmentation 
fault. Therefore, the server’s input was shown to be able 
to reproduce the error. This analysis was also found to 
disclose very little user information: nothing more than 
the fact that none of the input bytes were ‘Oxa’ were 
revealed. This was quantified as 0.9 byte. 


OpenVMPS. OpenVMPS is an open-source implemen- 
tation of Cisco Virtual Membership Policy Server, which 
dynamically assigns ports to virtual networks accord- 
ing to Ethernet addresses. The application has a format 
string bug which allows the input to supply a string with 
format specifiers as a parameter for vfprintf (). This 
could make vfprintf () write to a memory location. 
In the experiment, Panalyst server queried the client to 
get “O00 00 Oc 02” as illustrated in Figure 4. These 
four bytes were part of a branching condition, and seems 
to be a keyword of the protocol. We also found that the 
string “00 b9” were used as a loop counter. These two 
bytes were identified by the constraint solver. The string 
“62637” turned out to be the content that the format 
specifier “S19Shn” wrote to a memory location through 
vfprintf (). They were recovered from the client be- 
cause they were used as part of a pointer to access mem- 
ory. Our implementation successfully built a new in- 
put on the server that reproduced the error, as illustrated 
in Figure 4. This analysis recovered 39 bytes from the 
client, all of which were either related to branching con- 
ditions or memory access. An additional 18.4 bytes of 
information were estimated by the client to be leaked, as 
a result of the client’s answers which reduced the ranges 
of the values some symbols could take. 


Null-HTTPd. Null-HTTPd is a small web server work- 
ing on Linux and Windows. Its version 0.5 contains 
a heap-overflow bug, which can be triggered when the 
HTTP request is a POST with a negative Content 
Length field and a long request content. In our ex- 
periment, the client parsed the request using Wireshark 
and delivered nonsensitive information such as the key- 
word POST to the server. The server found that the 
application added 1024 to the value derived from the 
Content Length and used the sum as pointer in the 
function calloc. This resulted in a query for the value 
of that field, which the client released. At this point, the 
server acquired all the information necessary for repro- 
ducing the error and generated a new input illustrated in 
Figure 5. The information leaks caused by the analysis 
include the keyword, the value of Content Length, 
HTTP delimiters and the knowledge that some bytes are 
not special symbols such as delimiters. This was quan- 
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Table 2: Effectiveness of Panalyst. 
































Applications Vul. Type New Input Generated? Size of client’s message Info leaks (bytes) Rate of info leaks 
(bytes) 
Newspost Stack Overflow Yes 2056 0.9 0.04% 
OpenVMPS Format String Yes 199 57.4 28.8% 
Null-HTTPd Heap Overflow Yes 416 29.7 714% 
Sumus Stack Overflow Yes 500 77 1.54% 
Light HTTPd Stack Overflow Yes 211 17.9 8.48% 
ATP-HTTPd Stack Overflow Yes 819 16.7 2.04% 

















Original Packet Content for newspost in a crash 








Packet generated by panalyst 





Figure 3: Input Generation for Newspost. Left: the client’s packet; Right: the new packet generated on the server. 


tified as 29.7 bytes, about 7% of the HTTP message the 
client received. 


Sumus. Sumus is a server for playing Spanish “mus” 
game on the Internet. It is known that Sumus 0.2.2 and 
the earlier versions have a vulnerable buffer that can be 
overflowed remotely [14]. In our experiment, Panalyst 
server gradually constructed a new input through inter- 
actions with the client until the application was found 
to jump to a tainted address. At this point, the input 
was shown to be able to reproduce the client’s error. 
The information leaked during the analysis is presented 
in Figure 6, including a string “GET” which affected a 
path condition, and 4 “0x90”, which were the address 
the application attempted to access. These 7 bytes were 
counted as leaked information, along with the fact that 
other bytes were not a delimiter. 


Light-HTTPd. Light-HTTPd is a free HTTP server. Its 
version 0.1 has a vulnerable buffer on the stack. Our ex- 
periment captured an exception that happened when the 
application returned from the function vsprintf () 
and constructed the new input. The input shared 14 
bytes with the client’s input which were essential to de- 
termining branching conditions and accessing memory. 
For example, the keyword “GET” appeared on a condi- 
tional jump and the letter “H” were used as a condition in 
the GLIBC function strstr. The remaining 3.9 bytes 
were caused by the intensive string operations, such as 
strtok, which frequently used individual bytes for ta- 
ble lookup and comparison operations. Though these op- 
erations did not give away the real values of these bytes, 
they reduced the range of the bytes, which were quanti- 
fied into another 3.9 bytes. 


ATP-HTTPd. ATP-HTTPd 0.4 and 0.4b involve a re- 
motely exploitable buffer in the socket_gets () func- 
tion. A new input that triggered this bug was built in our 
experiment, which are presented in Figure 8. For exam- 
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ple, the string “EDCB” was an address the application at- 
tempted to jump to; this operation actually caused a seg- 
mentation fault. Information leaks during this analysis 
are similar to that of Light-HTTPd, which was quanti- 
fied as 16.7 bytes. 


5.2 Performance 


We also evaluated the performance of Panalyst. The 
client was deliberately run on a computer with 1 GHz 
CPU and 256MB memory to understand the performance 
impact of our technique on a low-end system. The server 
was on a high-end, with a 2.40GHz Core 2 Duo CPU 
and 3GB memory. In our experiments, we measured the 
delay caused by an analysis, memory use and bandwidth 
consumption on both the client and the server. The re- 
sults are presented in Table 3. 

The client’s delay describes the accumulated time that 
the client spent to receive packets from the server, com- 
pute answers, evaluate information leaks and deliver the 
responses. In our experiments, we observed that this 
whole process incurred the latency below 3.2 seconds. 
Moreover, the memory use on the client side was kept be- 
low 5 MB. Given the hardware platform over which this 
performance was achieved, we have a reason to believe 
that such overhead could be afforded by even a device 
with limited computing resources, such as Pocket PC and 
PDA. Our analysis introduced a maximal 99,659 bytes 
communication overhead. We believe this is still rea- 
sonable for the client, because the size of a typical web 
page exceeds 100 KB and many mobile devices nowa- 
days have the capability of web browsing. 

The delay on the server side was measured between 
the reception of an initial error report and the generation 
of a new input. An additional 15 seconds for launching 
our Pin-based analyzer should also be counted. Given 
this, the server’s performance was very good: the maxi- 
mal latency was found to be under | minute. However, 
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Original Packet Content for vmpsd in a crash 
00000000 41 01 41 01 41 41 41 41 
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Original Packet Content for NULL httpd in a crash 
oooo0000 {50 £ 20 545 














Packet generated by panalyst 
30 01 30 30 

















Figure 5: Input Generation for Null-HTTPd. Left: the client’s packet; Right: the new packet generated on the server. 


this was achieved on a very high-end system. Actually, 
we observed that the latency was doubled when moving 
the server to a computer with 2.36 GHz CPU and 1 GB 
memory. More importantly, the server consumed about 
100 MB memory during the analysis. This can be easily 
afforded by a high-end system as the one used in our ex- 
periment, but could be a significant burden to a low-end 
system such as a mobile device. As an example, most 
PDAs have less than 100 MB memory. Therefore, we be- 
lieve that Panalyst server should be kept on a dedicated 
high-performance system. 


6 Discussion 


Our research makes the first step towards a fully auto- 
mated and privacy-aware remote error analysis. How- 
ever, the current design of Panalyst is still preliminary, 
leaving much to be desired. For example, the approach 
does not work well in the presence of probabilistic er- 
rors, and our privacy policies can also be better designed. 
We elaborate limitations and possible solutions in the left 
part of this section, and discuss the future research for 
improving our technique in Section 7. 

The current design of Panalyst is for analyzing the er- 
ror triggered by network input alone. However, runtime 
errors can be caused by other inputs such as those from 
a local file or another process. Some of these errors can 
also be handled by Panalyst. For example, we can record 
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all the data read by a vulnerable program and organize 
them into multiple messages, each of which corresponds 
to a particular input to the program; an error analysis 
can happen on these messages in a similar fashion as de- 
scribed in Section 3. A weakness of our technique is that 
it can be less effective in dealing with a probabilistic er- 
ror such as the one caused by multithread interactions. 
However, it can still help the server build sanitized in- 
puts that drive the vulnerable program down the same 
execution paths as those were followed on the client. 

Panalyst may require the client to leak out some infor- 
mation that turns out to be unnecessary for reproducing 
an error, in particular, the values of some tainted pointer 
unrelated to the error. A general solution is describing 
memory addresses as symbolic expressions and taking 
them into consideration during symbolic execution. This 
approach, however, can be very expensive, especially 
when an execution involves a large amount of indirect 
addressing through the tainted pointers. To maintain a 
moderate overhead during an analysis, our current design 
only offers a limited support for symbolic pointers: we 
introduce such a pointer only when it includes a single 
symbol and is used for reading from memory. 

The way we treat loops is still preliminary: it only 
works on the loops with constant step sizes and may 
falsely classify a branching condition as a loop condi- 
tion. As a result, we may miss some real loops, which 
increases the communication overhead of an analysis, or 
require the client to unnecessarily disclose extra informa- 
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Original Packet Content for sumus in a crash 
00 00 08 08 08 08 08 08 08 08 08 08 
08 08 8 08 08 08 08 0 
00000020 08 0 08 08 08 0 
00000030 08 08 08 0 
oooo0040 8 08 08 08 08 0 


8 08 O08 08 08 08 








000001e0 
000001f0 


Figure 6: Input Generation for Sumus. Left: 


Original Packet Content for light httpd in a crash 
00 [47 45 54 5 90 90 90 9% 


Packet generated by panalyst 








Figure 7: Input Generation for Light HTTPd. Left: the client’s packet; Right: the new packet generated on the server. 


tion. However, the client can always refuse to give more 
information and set a threshold for the maximal number 
of the questions it will answer. Even if this causes the 
analysis to fail, the server can still acquire some infor- 
mation related to the error and use it to facilitate other 
error analysis techniques such as fuzz testing. We plan 
to study more general techniques for analyzing loops in 
our future research. 

Entropy-based policies may not be sufficient for reg- 
ulating information leaks. For example, complete dis- 
closure of one byte in a field may have different privacy 
implications from leakage of the same amount of infor- 
mation distributed among several bytes in the field. In 
addition, specification of such policies does not seem to 
be intuitive, which may affect their usability. More effec- 
tive privacy policies can be built upon other definitions of 
privacy such as k-Anonymity [46], /-Diversity [41] and 
t-Closeness [38]. These policies will be developed and 
evaluated in our future work. 

Panalyst client can only approximate the amount of 
information disclosed by its answers using statistical 
means. It also assumes a uniform distribution over the 
values a symbol can take. Design of a better alternative 
for quantifying and controlling information is left as our 
future research. 

Another limitation of our approach is that it cannot 
handle encoded or encrypted input. This problem can 
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be mitigated by interposing on the API functions (such as 
those in the OpenSSL library) for decoding or decryption 
to get their plaintext outputs. Our error analysis will be 
conducted over the plaintext. 


7 Related Work 


Error reporting techniques have been widely used for 
helping the user diagnose application runtime error. Win- 
dows error reporting [20], a technique built upon Mi- 
crosoft’s Dr. Watson service [18], generates an error 
report through summarizing a program state, including 
contents of registers and stack. It may also ask the user 
for extra information such as input documents to investi- 
gate an error. Such an error report is used to search an ex- 
pert system for the solution provided by human experts. 
If the search fails, the client’s error will be recorded for 
a future analysis. Crash Reporter [16] of Mac OS X 
and third-party tools such as BugToaster [27] and Bug 
Buddy [22] work in a similar way. As an example, Bug 
Buddy for GNOME can generate a stack trace using gdb 
and let the user post it to the GNOME bugzilla [4]. 
Privacy protection in existing error reporting tech- 
niques mostly relies on the privacy policies of those who 
collect reports. This requires the user to trust the collec- 
tor, and also forces her to either send the whole report 
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Figure 8: Input Generation for ATP HTTPd. Left: the client’s packet; Right: the new packet generated on the server. 


Table 3: Performance of Panalyst. 



































Programs client delay (s) client memory use | server delay (s) server memory total size of questions total size of answers 
(MB) use (MB) (bytes) (bytes) 

Newspost 0.022 47 12.14 99.3 527 184 

OPenVMPS 1.638 3.9 17 122.3 45,610 6,088 

Null-HTTPd 1.517 5.0 13.09 118.1 99,659 3,416 

Sumus 0.123 4.8 1.10 85.4 5,968 2,760 

Light HTTPd 0.88 48 6.59 110.1 14,005 2,808 

ATP-HTTPd 3.197 5.0 37.11 145.4 50,615 15,960 














or submit nothing at all. In contrast, Panalyst reduces 
the user’s reliance on the collectors to protect her privacy 
and also allows her to submit part of the information she 
is comfortable with. Even if such information is insuf- 
ficient for reproducing an error, it can make it easier for 
other techniques to identify the underlying bug. More- 
over, Panalyst server can automatically analyze the error 
caused by an unknown bug, whereas existing techniques 
depend on human to analyze new bugs. 

Proposals have been made to improve privacy protec- 
tion during error reporting. Scrash [25] instruments an 
application’s source code to record information related 
to a crash and generate a “clean” report that does not 
contain sensitive information. However, it needs source 
code and therefore does not work on commodity ap- 
plications without the manufacturer’s support. In ad- 
dition, the technique introduces performance overheads 
even when the application works properly, and like other 
error reporting techniques, uses a remote expert sys- 
tem and therefore does not perform automatic analy- 
sis of new errors. Brickell, et al propose a privacy- 
preserving diagnostic scheme, which works on binary 
executables [24, 36]. The technique aims at searching 
a knowledge base framed as a decision tree in a privacy- 
preserving manner. It also needs to profile an applica- 
tion’s execution. Panalyst differs from these approaches 
in that it does not interfere with an application’s normal 
run except logging inputs, which is very lightweight, and 
is devised for automatically analyzing an unknown bug. 

Techniques for automatic analysis of software vulner- 
abilities have been intensively studied. Examples include 
the approach for generating vulnerability-based signa- 
tures [26], Vigilante [30], DACODA [31] and EXE [53]. 
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These approaches assume that an input triggering an er- 
ror is already given and therefore privacy is no longer 
a concern. Panalyst addresses the important issue on 
how to get such an input without infringing too much 
on the user’s privacy. This is achieved when Panalyst 
server is analyzing the vulnerable program. Our tech- 
nique combines dynamic taint analysis with symbolic ex- 
ecution, which bears some similarity to a recent proposal 
for exploring multiple execution paths [42]. However, 
that technique is primarily designed for identifying hid- 
den actions of malware, while Panalyst is for analyzing 
runtime errors. Therefore, we need to consider the issues 
that are not addressed by the prior approach. A promi- 
nent example is the techniques we propose to tackle a 
tainted pointer, which is essential to reliably reproducing 
an error. 

Similar to Panalyst, a technique has been proposed re- 
cently to symbolically analyze a vulnerable executable 
and generate an error report through solving con- 
straints [29]. The technique also applies entropy to 
quantify information loss caused by the error report- 
ing. Panalyst differs from that approach fundamentally 
in that our technique generates a new input remotely 
while the prior approach directly works on the causal 
input on the client. Performing an intensive analysis 
on the client is exactly the thing we want to avoid, be- 
cause this increases the client’s burden and thus discour- 
ages the user from participating. Although an evalua- 
tion of the technique reports a moderate overhead [29], it 
does not include computation-intensive operations such 
as instruction-level tracing, which can, in some cases, 
introduce hundreds of seconds of delay and hundreds of 
megabytes of execution traces [23]. This can be barely 


USENIX Association 





acceptable to the user having such resources, and hardly 
affordable to those using weak devices such as PocketPC 
and PDA. Actually, reproducing an error without direct 
access to the causal input is much more difficult than 
analyzing the input locally, because it requires a care- 
ful coordination between the client and the server to en- 
sure a gradual release of the input information without 
endangering the user’s privacy and failing the analysis 
at the same time. In addition, Panalyst can enforce pri- 
vacy policies to individual protocol fields and therefore 
achieves a finer-grained control of information than the 
prior approach. 


8 Conclusion and Future Work 


Remote error analysis is essential to timely discovery of 
security critical vulnerabilities in applications and gener- 
ation of fixes. Such an analysis works most effectively 
when it protects users’ privacy, incurs the least perfor- 
mance overheads on the client and provides the server 
with sufficient information for an effective study of the 
underlying bugs. To this end, we propose Panalyst, a 
new techniques for privacy-aware remote error analy- 
sis. Whenever a runtime error occurs, the Panalyst client 
sends the server an initial error report that includes noth- 
ing but the public information about the error. Using 
an input built from the report, Panalyst server analyzes 
the propagation of tainted data in the vulnerable applica- 
tion and symbolically evaluates its execution. During the 
analysis, the server queries the client whenever it does 
not have sufficient information to determine the execu- 
tion path. The client responds to a question only when 
the answer does not leak out too much user information. 
The answer from the client allows the server to adjust 
the content of the input through symbolic execution and 
constraint solving. As a result, a new input will be built 
which includes the necessary information for reproduc- 
ing the error on the client. Our experimental study of 
this technique demonstrates that it exposes a very small 
amount of user information, introduces negligible over- 
heads to the client and enables the server to effectively 
analyze an error. 

The current design of Panalyst is for analyzing the er- 
ror triggered by network inputs alone. Future research 
will extend our approach to handle other types of errors. 
In addition, we also plan to improve the techniques for 
estimating information leaks and reduce the number of 
queries the client needs to answer. 
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Abstract 


We analyze several recent schemes for watermarking net- 
work flows based on splitting the flow into intervals. We 
show that this approach creates time dependent correla- 
tions that enable an attack that combines multiple wa- 
termarked flows. Such an attack can easily be mounted 
in nearly all applications of network flow watermarking, 
both in anonymous communication and stepping stone 
detection. The attack can be used to detect the presence 
of a watermark, recover the secret parameters, and re- 
move the watermark from a flow. The attack can be ef- 
fective even if different the watermarks in different flows 
carry different messages. 

We analyze the efficacy of our attack using a proba- 
bilistic model and a Markov-modulated Poisson process 
(MMPP) model of interactive traffic. We also implement 
our attack and test it using both synthetic and real-world 
traces, showing that our attack is effective with as few 
as 10 watermarked flows. Finally, we propose a counter- 
measure that defeats the attack by using multiple water- 
mark positions. 


1 Introduction 


Traffic analysis is the practice of inferring sensitive in- 
formation from communication patterns. Traffic analy- 
sis has been particularly studied in the context of anony- 
mous communication systems, where features such as 
packet timings, sizes, and counts can be used to link two 
flows and break anonymity guarantees [2, 22]. Traffic 
analysis is also sometimes used in intrusion detection, 
for example, to detect the presence of stepping stones 
within an enterprise [29]. 

Recently, there has been a growing interest in the use 
of watermarking to aid traffic analysis [27, 24, 21, 25, 
28]. In this case, traffic patterns of one flow (usually 
packet timings) are actively modified to contain a spe- 
cial pattern. If the same pattern is later found on another 
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flow, the two are considered linked. Watermarking sig- 
nificantly reduces the computation and communication 
costs of traffic analysis, and may also lead to more pre- 
cise detection with fewer false positives.'! Watermark- 
ing has been applied to both the problems of attacking 
anonymity systems [24, 25, 28] and detecting stepping 
stones [27, 21]. 

In both contexts, many flows must be watermarked 
before linked flows are discovered. In our work, we 
consider whether an attacker can learn enough infor- 
mation to defeat the watermark by observing multiple 
watermarked flows. (We use “attacker” here to refer 
to someone attacking the watermarking scheme; in the 
case where watermarks themselves are used by attack- 
ers, these will be the “counter-attackers.”) We apply 
this multi-flow threat model to the latest generation of 
interval-based watermarks [21, 25, 28]. These water- 
marks subdivide the flow to be marked into discrete time 
intervals and perform transformative operations on an 
entire interval of packets. This approach is more ro- 
bust to packet losses, insertions, and repacketization than 
previous approaches that focused on individual pack- 
ets [27, 24], because the time intervals allow the water- 
marker and detector to retain synchronization. However, 
the same synchronization property can be used by at- 
tackers by “lining up” multiple watermarked flows and 
observing the transformations that were inserted. 

We show through experiments that the interval-based 
watermark schemes are completely vulnerable to an at- 
tacker who can collect a small number of watermarked 
flows—about 10. This is sufficient to not only detect that 
a watermark is indeed present, but also to recover the se- 
cret parameters of the watermark scheme and to be able 
to remove the watermark at a low cost. Furthermore, our 
attack works even if different watermarked flows contain 
different embedded “messages,” with only about twice 
the number of watermarked flows necessary. 

We also consider some countermeasures to such at- 
tacks. We show that by using multiple “keys” (time inter- 
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val assignments) to watermark different flows, it is pos- 
sible to defeat our attack. This countermeasure comes 
at a cost of higher computation overhead at the detector 
and a higher rate of false positives. However, this in- 
creased cost is only linear, whereas the increased cost for 
the attacker is superexponential, thus providing an effec- 
tive defense. 

The rest of the paper is organized as follows. The next 
section presents the setting for our attack and reviews 
the three schemes considered in this paper. Section 3 
describes the theoretical foundation for our attack, and 
Section 4 implements the attack. We discuss potential 
countermeasures to the attack in Section 5. Section 6 
concludes. 


2 Background 


We first describe the setting of our attack in a bit more 
detail and then review the essential details of the water- 
marking schemes we analyze. 


2.1 Network Flow Watermarking 


The setting for network flow watermarking is similar to 
that of other digital media watermarks (and network flow 
watermarks use similar techniques). The general model, 
as shown in Figure 1, involves a network flow passing 
through a watermarking point (typically a router of some 
sort) that transforms, or distorts, the flow in some way 
(typically by modifying packet timings by selectively de- 
laying some packets). In the general setting, the water- 
marker has a secret key and uses it to encode a message 
in the traffic characteristics. 

After watermarking, the flow undergoes some natu- 
ral or intentional distortion. Natural distortion can take 
the form of delays at intermediate routers (or rather, 
variability of delays, i.e., jitter), but may also include 
dropped or retransmitted packets, repacketization, and 
other changes. In addition, an attacker may intention- 
ally distort traffic characteristics in order to prevent the 
watermark from being recovered. 

The distorted flow finally arrives at a detection point. 
The detector shares the secret key and uses it to extract 
the message encoded in the watermark. A good water- 
mark will allow reliable recovery of the message from 
the watermarked flow despite the intermediate distortion. 

In network flow watermarks, the message component 
of the watermark may be used in two ways. First, all wa- 
termarked flows may be marked with a single message. 
In this case, the detector’s main goal is to decide whether 
the watermark is present or not by checking whether the 
decoded message is the correct one. Alternately, dif- 
ferent flows may have a different message embedded, 
so that when a watermarked flow is detected, it can be 
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Figure 2: An anonymous system. 


linked with a particular marked flow. This comes at a 
cost of less reliable detection, since the single-message 
context creates more opportunities to detect errors. Our 
attacks are designed to work in both single-message and 
multiple-message contexts. 


2.2. Watermarks in Anonymous Systems 


At a very high level, an anonymous system maps a num- 
ber of input flows to a number of output flows while hid- 
ing the relationship between them, as shown in Figure 2. 
The internal operation can be implemented by a mix net- 
work [8], onion routing [23], or a simple proxy [6]. The 
goal of an attacker, then, is to link an incoming flow to 
an outgoing flow (or vice versa). 

A watermark can be used to defeat anonymity pro- 
tection by marking certain input flows and watching for 
marks on the output flows. For example, a malicious 
website might insert a watermark on all flows from the 
site to the anonymizing system. A cooperating attacker 
who can eavesdrop on the link between a user and the 
anonymous system can then determine if the user is 
browsing the site or not. Similarly, a compromised en- 
try router in Tor [11] can watermark all of its flows, and 
cooperating exit routers or websites can detect this wa- 
termark. 

Note that this does not enable a fundamentally new 
attack on low-latency anonymous systems: it has been 
long known [23] that an attacker who can observe a flow 
at two points can determine if the flow is the same, un- 
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less cover traffic is used. (In fact, deployed low-latency 
systems such as Onion Routing [23], Freedom [1], and 
Tor [11] have all opted to forego cover traffic due to it 
being expensive, hoping instead that it will be difficult 
for an attacker to observe a significant fraction of incom- 
ing and outgoing flows.) However, watermarking makes 
the attack much more efficient. With passive traffic anal- 
ysis, if one attacker observes n input flows and another 
observes m output flows, the attack will require O(n) 
communication between the attackers and O(nm) com- 
putation, as one attacker must transmit characteristics of 
all n flows to the other, and then each output flow must 
be matched against each input flow. With watermarking, 
on the other hand, no communication needs to take place 
between the two attackers after they have established a 
shared secret key, and the computation cost is O(n) and 
O(m) at the watermarker and detector respectively, as 
the watermarker marks each input flow and the detector 
checks each output flow for the presence of a mark. 


Multi-Flow Attack In the above examples, a website 
or an input router will insert the watermark into all the in- 
put flows going through them. Therefore, it will be pos- 
sible for the anonymous system to obtain multiple water- 
marked flows. These flows can then be used to recover 
the secret key and then remove the watermarks from sub- 
sequent flows, using the techniques we describe below. 
Our techniques are low-cost, requiring a small number 
of watermarked flows and modest computation, so it is 
easy to check whether watermarking is being applied by 
a given website or router by aggregating its flows. 

The only context where our attack does not apply is 
in a traffic confirmation attack. In this case, an attacker 
already has a strong suspicion that a particular input flow 
corresponds to a particular output flow, and therefore 
need only watermark a single flow. Traffic confirmation 
attacks are a more rare use of traffic analysis, since they 
only confirm existing suspicions, rather than revealing 
new linkages between flows. Furthermore, the efficiency 
gains of watermarks are not beneficial in this case, since 
n =m = 1. Therefore, our attack will apply to the vast 
majority of practical uses of watermarks in anonymous 
systems. 


2.3 Watermarks in Stepping Stones 


A stepping stone is a host that is used to relay traffic 
through an enterprise network to another remote destina- 
tion, in order to hide the true origin of the flow. To detect 
such hosts, an enterprise must be able to link an incom- 
ing flow to the relayed outgoing flow. The situation is 
therefore very similar to an anonymous communication 
system, with n flows entering the enterprise and m flows 
leaving. Once again, this task may be accomplished by 
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Figure 3: Stepping stone detection architecture. 


passive traffic analysis [26, 29, 5, 12], but watermarks 
make such detection much more efficient. Passive tech- 
niques will require O(nm) computation and potentially 
O(n) communication, if there are multiple border routers 
through which traffic can enter or leave the enterprise. 
With watermarking, border routers for an enterprise will 
insert watermarks on all incoming flows, and check for 
the presence of the mark on all outgoing flows, as shown 
in Figure 3, reducing the computation cost to O(n) and 
O(m) for the incoming and outgoing flows. 


Multi-Flow Attack Since all incoming flows must be 
marked, an attacker in control of a compromised host can 
simply generate multiple external flows destined for that 
host (and not relay them), and then collect the timing 
characteristics of the flows as they arrive at the host to 
recover the secret watermark key. Once this is accom- 
plished, the key can be used to remove watermarks from 
relayed flows, thus defeating stepping stone detection. 


2.4 Interval Centroid-Based Watermark- 
ing (ICBW) 


We next review the scheme proposed by Wang et al. [25]; 
for more details of the scheme as well as some analysis 
we refer the reader to [25]. The scheme is based on di- 
viding the stream into intervals of equal lengths, using 
two parameters: o, the offset of the first interval, and T, 
the length of each interval. A subset of 2n = 2rl of 
these intervals are chosen at random, and then randomly 
divided into two further subsets A and B each consist- 
ing of n = rl intervals. Each of the sets A and B are 
randomly divided to | subsets denoted by {A;}/_, and 
{B;}4_,, each consisting of r intervals. The i-th water- 
mark bit is encoded using the sets { A;, B;}. Therefore, a 
watermark of length / can be embedded in the flow. Fig- 
ure 4 depicts the random selection and grouping of time 
intervals within a flow for watermark insertion. 
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Figure 4: Random selection and assignment of time intervals within a packet flow for watermark insertion. 
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Figure 5: Distribution of packet arrival times in an inter- 
val of size T’ before and after being delayed. 


The watermarker and detector agree on the parameters 
o, T’ and use a random number generator (RNG) and a 
seed s to randomly select and assign intervals for water- 
mark insertion. To keep the watermark transparent, all of 
these parameters are kept secret. Depending on whether 
the i-th watermark bit is 1 or 0, the watermarker delays 
the arrival times of the packets at the interval positions in 
sets A; or B; respectively, by a maximum of a. Figure 5 
illustrates the effect of this delaying strategy over the dis- 
tribution of packet arrival times in an interval of size T 
(this operation is called “squeezing” by Wang et al.) Fi- 
nally, the overall watermark embedding is illustrated in 
Figures 6 (a) and (b). 

As the result of this embedding scheme, the expected 
value of aggregate centroid, i.e., the average offset of 
the packet arrival time from the beginning of the cur- 
rent length T interval, in either the intervals A; (when 
watermark bit is 1) or B; (when watermark bit is 0) cor- 
responding to bit 7 is increased by 5. The difference be- 
tween the aggregate centroid of A; and B; now will be $ 
when watermark bit is 1 or — $ when watermark bit is 0. 


The detector checks for the existence of the watermark 
bits. The check on watermark bit 7 is performed by test- 


17th USENIX Security Symposium 


A; (for watermark bit i=0) 
B; (for watermark bit i=O) 


(a) Insertion of watermark bit 0 


‘de 


A; (for watermark bit i=1) 


ee 


B; (for watermark bit i=1) 





(b) Insertion of watermark bit 1 


Figure 6: ICBW bit insertion 


ing whether the average difference of the aggregate cen- 
troid of packet arrival times in the intervals A; and B; is 
closer to 5 or —5. If itis closer to }, then the watermark 
bit is decoded as 1 and if it is closer to =a. the bit is 
declared a 0. By focusing on the arrival times of many 
intervals (r of them for each bit of the watermark) rather 
than individual packet timings, the ICBW approach is 
robust to repacketization, insertion of chaff, and mixing 
of data flows. Network jitter can shift packets from one 
interval into another, but the suggested parameters for a 
and T' (350ms and 500ms respectively) are large enough 
that few packets will be affected. 

The secrecy of the interval positions A; and B; make 
the mark difficult to detect or remove, as it is hard to dis- 
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tinguish the patterns generated by the mark from natural 
variation in traffic rates. We show in Sections 3 and 4, 
however, that a simple technique allows an observer to 
effectively recover the watermark positions and values. 
This technique is applicable to any watermarking scheme 
that creates periods of clear or low traffic at specific parts 
of the flows across many flows. Next, we briefly describe 
Interval-Based Watermarking (IBW), a flow watermark- 
ing scheme proposed by Pyun et al. [21] to detect step- 
ping stones. Our attacks also applies to this scheme. 


2.5 Interval-Based Watermarking 


Similar to ICBW, the watermarking scheme of Pyun 
et al. [21] manipulates the arrival times of the packets 
over a set of preselected intervals. The watermark em- 
bedding is achieved by manipulating the rates of traffic 
in successive intervals. There are two manipulations: an 
interval I; may be cleared by delaying all packets from 
interval J; until interval [;,1, or it may be loaded by 
delaying all packets from interval J;_; until interval J;. 
A loaded interval will therefore have twice the expected 
number of packets, and a cleared one will have none. 
To send a 0 bit in position 2, the interval J; is cleared 
and [;,1 is loaded; to send a 1, J; is loaded and J; is 
cleared. (Note that since clearing one interval implicitly 
loads the next, it takes 3 intervals to send a bit.) 

The watermarker and detector agree on the parameters 
o, T and a list of positions S = {s1,...,5,}; all of these 
parameters are secret. The watermarker encodes the wa- 
termark bits at the interval positions s; and the detector 
checks for the existence of the watermark. The check is 
performed by testing whether the data rate in interval J,, 
differs from the rate in interval [,,,1 by a factor exceed- 
ing a threshold; if it does, then a 0 or | bit is considered 
detected. By focusing on data rates rather than individual 
packet timings, the interval-based approach is robust to 
repacketization of data flows. 

The detection process may generate false positives due 
to natural variation in packet rates, or false negatives, as 
delays between the watermarker and repacketization at 
the relay cause rates in intervals to shift. To ensure reli- 
able transmission, each watermark bit is encoded in sev- 
eral positions in the stream. Pyun et al. show that this 
technique operates with very low false positive and false 
negative rates. 


2.6 Spread-Spectrum Watermarking 


In the DSSS watermarking technique due Yu et al. [28], 
a binary watermark is embedded in the flow to achieve 
invisible traceback. In their proposed approach, each bit 
of a length n binary watermark is embedded in an inter- 
val of length T’,. Hence the whole watermark is inserted 
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Figure 7: A length-5 PN code and insertion of DSSS 
watermark 110. 


in an interval of length nT. To embed a watermark bit 
1, the rate of the packets in the designated interval of 
length T’, are manipulated according to a Pseudo-Noise 
(PN) code. The PN code is a quickly varying signal that 
switched between +1 and —1 and duration of each +1 
period is T,. In particular, Yu et al. [28] choose a length- 
7 PN code for their implementation. When PN code is 
+1, the rate of the flow remains intact, but when PN code 
is —1, the rate of the flow is decreased for a duration of 
T..2. On the other hand, to embed a watermark bit 0, 
the flow is manipulated using the complement of the PN 
code. Figure 7 depicts the embedding of watermark 110 
for a PN code of length 5. 

The watermarker and detector agree on the parame- 
ter TJ’, and a Pseudo-Noise code. The detector recovers 
the watermark by first applying a high-pass filter to the 
received signal and subsequently passing it through de- 
spreading and a low-pass filter. The details of the detec- 
tor’s structure are inconsequential to our attack and the 
interested reader is referred to [28]. 

Given that the watermark insertion technique in DSSS 
reduces the flow rates over certain intervals across all 
flows, it is vulnerable to our multi-flow attack. 





3 Attack Analysis 


In this section, we present a probabilistic analysis of our 
attack using a model for interactive traffic. Though some 
watermarked traffic may consist of non-interactive bulk 
transfer traffic, we will show in Section 4.1 that interac- 
tive traffic presents a more difficult case for our attack, 
and thus we analyze it here. As DSSS watermarks work 
well only against non-interactive traffic, our analysis here 
applies only to IBW and ICBW, but as we demonstrate 
experimentally, our attack will work on DSSS water- 
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marks as well. 


3.1 Model of Interactive Traffic 


We first present a model for interactive traffic, as it is 
essential to our analysis. Let f,, denote the m-th flow in 
a pool of interactive traffic flows. Given that the traffic 
might be encrypted, we do not consider the content of 
the packets; likewise, the sizes of packets representing 
keystrokes are likely to be uniform. We thus consider 
only the arrival time of the packets in the flow, allowing 
us to model the flow as a point process. 

Suppose we observed packet arrivals at times t; < 
tg < +--+ < ty, ina fixed interval (0,7] such that t; is 
the time the 7-th packet arrived. The collection of arrival 
times tr, = (ti, ta,...,tn) specifies a flow f,,. Further- 
more, we model the interactive connection as a Markov- 
modulated Poisson process (MMPP) [14, 15]. The set of 
possible states are {0,1}, where state 0 corresponds to 
user typing characters and state | corresponds to periods 
of silence. Figure 8 depicts this two-state MMPP. 

Let X(t) denote the state of the process at time tf. 
When the process is in state 0, packet arrivals are mod- 
eled as a renewal process; i.e. the interarrival times are 
independent and identically distributed (i.i.d.). In case 
of interactive traffic flow, this renewal process is often 
modeled as Poisson [12, 5]. The Poisson assumption 
means that the interarrival times of the packets, denoted 
by @, are exponentially distributed. Hence their probabil- 
ity density function (PDF) is given by: 


to (t) = Ae rot 


where Xo denotes the rate of the Poisson process. When 
the process is in state 1, the arrivals are again modeled 
as Poisson but with rate Ay < Apo. Given that state 1 
corresponds to a period of silence (no packet arrivals), 
as soon as a packet arrives, the embedded Markov chain 
transitions to state 0. Therefore, the transition probabil- 
ities {P;;,i,7 = 0,1} of the embedded Markov chain 
{Xn,n > 0} are as follows: 


Poo + Poi = 1, 
Por = 1, Pir = 0 (1) 


and the embedded Markov chain is defined by the matrix: 


Poo 1 
1—Poo O 


The steady state probabilities 7,71 of the embedded 


chain X,, are given by: 
Poo 1 TO 
1— Poo 0 Ty 


| 
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Figure 8: The embedded two-state Markov chain. 
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The steady state probabilities Pp, P, of the Markov pro- 
cess X(t) are given by [15]: 


TO 
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R= Ai oe (1 — Poo) Ao 
At + (1 — Poo) Ao’ Ai + (1 — Poo)Ao 





(2) 


The significance of the steady state probabilities of (2) 
is that they capture the probability of each of the states 
0 and 1 at any given point in time. Recall that ICBW 
encodes the watermark bits “1” or “0” by delaying the 
arrival times of the packets in the set of intervals A; or 
B; respectively and IBW encodes the watermark bits “0” 
or “1” by transferring the traffic of an interval of length 
T to some adjacent interval. Therefore, they both create 
periods of times with no arrivals in the flow. This period 
for ICBW is of length a and for IBW is of length T’. 
When the embedded Markov chain is in state 7, we can 
compute the probability of zero occurring in a period of 
length @ starting at any given point as: 


PRioae (3) 


since the waiting times are exponentially distributed and 
therefore memoryless. 

In general, given a flow f,, generated from an MMPP, 
from (3), the probability of having a period of length @ 
with no arrivals Py, (0; 2) is: 


Py, (0; 2) = PoPyo (0; £) + P, Py, (0; 2) 


Poe! + Pre >? (4) 


where the steady state probabilities { Pp, P,} are given 
by (2). 

A good watermarking scheme requires that the water- 
marked stream should not reveal any clues of the pres- 
ence of the watermark to unauthorized observer. There- 
fore, it is desirable to pick ¢ such that Pr, (0; @) above 
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should be reasonably large, so that presence of silent pe- 
riods does not give away the watermark. We next present 
parameters of our two-state MMPP and show that, for 
those parameters, the watermark indeed cannot be de- 
tected by observing a single stream watermarked with 
ICBW or IBW. However, we will show that if attackers 
have access to multiple copies of a marked signal, they 
can defeat the two watermarking schemes both when 
multiple flows are watermarked with the same message 
and when different messages are embedded in different 
flows. 


3.2 Parameter Selection and Goodness of 
Fit 
We estimated the parameters Poo, Ao, and A, of our 
MMPP model by using network traces of SSH connec- 
tions taken at a wireless access point in our institution. 
For a trace, we first estimated the underlying state of the 
embedded Markov chain by choice of a threshold 7. If 
the interarrival time between two packets exceeded the 
threshold 7, we assumed that the process was in state | 
and if the interarrival time between two packets was less 
than the threshold 7, we assumed that the user was typing 
and therefore the process was in state 0. Once the states 
{X,,,n > 0} of the underlying chain are determined, by 
concatenation of the parts of the interactive traffic that 
came from same underlying state, we could extract two 
Poisson flows with rates Ap and A; from the original flow. 
Given that the expected number of arrivals of a Pois- 
son process distribution with parameter in time interval 
(0, ¢] is At, we estimated the rates Ap and , by calculat- 
ing the arrival rates of each of the two extracted flows. 
Parameter Po9 was estimated as the portion of the time 
the chain spent at state 0. Our estimated values for the 
transition probability Poo and the rates Ag and 1 were 
as follows: 


Poo = .96 Ao = 5.6 Ay = 0.57 (5) 


To assess the goodness of fit of our MMPP model 
with parameters of (5), we used a quantile—quantile (q— 
q) plot [7]. Using the theoretical CDF of the model, the 
observations are mapped into values in interval [0, 1]. If 
the underlying statistical model of the data is consistent 
with the observations, the values obtained from the map- 
ping are uniformly distributed in the interval [0,1]. To 
assess the uniformity of the mapped values or equiva- 
lently assessing the goodness of the theoretical model an 
empirical CDF of the mapped values is compared against 
the theoretical CDF of a uniform distribution, which is 
a 45-degree reference line. The closer the CDF to this 
reference line, the greater the evidence that the statisti- 
cal model captures the underlying phenomenon. The q— 
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Figure 9: Q—Q plot of Poisson and MMPP models with 
our sample data. 


q plot in Figure 9 shows that the MMPP model for the 
interactive traffic with parameters (5) provides a good fit 
for the data and significantly outperforms a simpler Pois- 
son model, or a Pareto distribution that has been previ- 
ously proposed to fit interactive traffic [19]. 


3.3. Multi-Flow Attack 


Regardless of whether the ICBW or IBW watermark- 
ing schemes are implemented using the same message 
across all interactive flows or they use multiple messages 
for different flows, they are subject to an averaging at- 
tack. This is because both schemes embed watermarks 
by emptying the same parts across various flows. Next, 
we will explain our attack for both the single-message 
and multiple-message watermarks. 


3.3.1 Single-Message Watermarks 


When ICBW or IBW watermarking schemes are imple- 
mented using the same message across all interactive 
flows, an attacker who has access to k watermarked flows 
can form an aggregate of all the flows, taking the sorted 
union of all the arrival times of packets in all flows. We 
denote this aggregated stream by f;,, where the subscript 
k; denotes the number of streams involved in forming the 
aggregate flow. 

Given that each interactive stream is independent of 
all the other streams, the probability of having a period 
of length T with no arrivals in the flow f;, is given by: 


k 
P{N3(ta +) — Nz(ta)=0} = [1 P04 


= Pp (0;2)" (6) 
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Equation (6) shows that probability of having period of 
length @ with no arrivals decreases exponentially in k, 
the number of copies used to form the aggregate flow f;,. 
Therefore, if the streams are not watermarked there is a 
very small probability that the aggregate stream has peri- 
ods of no arrivals. However, if both ICBW and IBW use 
the same key and message across all interactive flows, 
the aggregated copy of the watermarked flows always ex- 
hibits patterns of no arrivals of length @ that give away the 
location of the watermark as well as the maximum delay 
parameter a of ICBW and the period T of IBW. 


Substituting the parameters of (5) into (4), assuming 
£ = 350ms, as suggested by Wang et al. [25], we have 
P,,, (0;0.35) = 0.33. Therefore, in an aggregate of as 
few as 10 flows probability of a period of 350ms without 
any arrivals is as low as Py,,(0;0.35)'° = 1.6 x 107°, 
Similarly, for 2 = 900ms, as used by Pyun et al. [21], 
we have Py, (0;0.9) = 0.17 and Py, (0; 0.9)'° = 2.4 x 
i, 


3.3.2 Multi-Message Watermarks 


If different flows are used to encode different messages, 
simple aggregation will no longer work, since by switch- 
ing between | and 0 bits, both ICBW and IBW apply 
different transforms to different intervals. For example, 
with ICBW, a given interval may be squeezed when a 
certain bit is 0, and not squeezed when that bit is 1. By 
aggregating flows where that bit changes, no empty peri- 
ods will be detected. 


However, by observing a few more flows, we can still 
detect the presence of a watermark. Given a bit b and a 
set of 2k—1 flows, by the pigeon hole principle, there ex- 
ists a subset of k flows where the bit has the same value. 
If we aggregate all the flows in that subset, we will find 
clear intervals of length a or T’, depending on the scheme 
that is used. 


To detect the watermark, then, we examine all . a) 


subsets of k flows out of a collection of 2k — 1. For 
each bit position, we will be able to find at least one sub- 
set where that bit value is all the same, and we can thus 
detect it with the same ease as when a single-valued wa- 
termark is used. The number of subsets is, of course, 
superexponential in k, but our attack works with val- 
ues of & around 10, making such a search feasible, as 
(12) = 92378. 


Examining all these subsets increases the possibility 
of a false positive—a naturally occurring cleared interval 
in the aggregate flow. However, such false positives will 
be relatively rare, so the attacker can estimate the value 
of @ and then discard intervals that do not match. 
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3.4 Impact of Timing Perturbations 


Our analysis so far has assumed that the attacker sees the 
timings of the watermarked stream directly. In reality, 
these timings will be perturbed by network delays. As a 
result, the intervals cleared by the watermark may have 
some packets from previous intervals shifted into them 
and no longer appear completely empty. Note that what 
is relevant here is not the magnitude of the network de- 
lay but its variance, or jitter, since delaying all packets 
by an equal amount does not affect our attack. And if 
the jitter is much less than @, our attack will work equally 
well: if jitter is < € with high probability, then we will 
find clear intervals of length at least  — ¢€ in the k ag- 
gregated watermarked streams, whereas the probability 
of seeing such an interval in unwatermarked streams is 
Py,,(0;£—€)* = Py, (0; £)*, which is vanishingly small. 
We observe that the studied parameters of the ICBW and 
IBW schemes have £ = 350ms or 900ms, in order to 
resist traffic perturbations, repacketization, etc. The net- 
work jitter, on the other hand, is two orders of magnitude 
smaller. Our experiments on PlanetLab [3] show it to be 
on the order of several milliseconds for geographically 
distributed hosts, and this matches the results of previous 
studies [18]. Therefore, it is indeed the case that the jitter 
is<e< £,so it will not significantly affect our attack. 


4 Implementation 


Having shown the theoretical background behind our at- 
tack, we now show the result of implementing it in prac- 
tice. We developed algorithms to detect the presence of a 
watermark, recover the secret parameters, and to remove 
the watermark from new streams. We evaluated the al- 
gorithms using both real flows gathered from traces and 
synthetic flows generated using our MMPP model, pre- 
sented in Section 3.1. We first present our attacks for 
single-message watermarks, and then extend it to water- 
marks that use multiple messages. 


4.1 Watermark Detection 


As above, our attack relies on collecting a series of flows 
that are watermarked with the same message. These 
flows are combined into a single flow and examined for 
large gaps between packets. Figure 10(a) shows the 
packet arrivals for 10 combined flows before and after 
an ICBW watermark has been applied. The watermark 
pattern is clearly visible in the combined flows, reveal- 
ing the presence of a watermark. Figure 10(b) shows the 
same process working with the IBW watermark scheme. 

We also performed the same analysis for non- 
interactive, bulk transfer traffic by applying the water- 
mark to packet traces we collected from web downloads 
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Figure 10: 10 flows before and after watermarking. 


across a DSL connection. Figure 11(a) shows the packet 
timings for 10 combined flows before and after a water- 
mark. Bulk transfers have a somewhat more regular be- 
havior, since they are controlled by the TCP algorithms, 
rather than by individual users. This can be seen at the 
beginning of the 10 combined flows before watermark: 
the TCP slow start period results in a much lower rate 
for the first few seconds of the connection. However, 
this regularity quickly gets out of sync due to irregular 
network delay and response times. In the graph of 10 
watermarked flows, the intervals squeezed by the water- 
mark are readily visible. In fact, because data transfer 
flows are much more dense than interactive flows, the 
watermark is visible even on a single flow (Figure 11(b)). 


The DSSS watermark is intended to be applied to bulk 
transfer traffic such as FTP, since it interferes with traf- 
fic rate, rather than changing packet timings. A similar 
multi-flow attack works against DSSS as well, as shown 
in Figure 12. (We used the parameters of chip length 
0.4s, chip sequence length of 7, and code length of 7.) In 
this case, periods of high interference are clearly seen as 
low-rate periods in the flows, allowing one to recover the 
chip sequence and then decode the watermark. 
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Figure 11: Watermark detection on bulk traffic. 


4.2 Watermark Removal 


Based on the combined graphs, it is easy to recover the 
watermark parameters as well. We can build a template 
of clear intervals by selecting all intervals larger than a 
threshold; for example, Figure 13(a) shows the template 
derived from 10 flows watermarked by ICBW. The esti- 
mated template is somewhat imprecise, due to network 
jitter, as well as the fact that small (10—20ms) gaps may 
precede or follow the clear intervals even when 10 flows 
are combined. However, this imprecision is not a prob- 
lem since the watermark can still be effectively removed. 
The template also lets us estimate the values of T' and a. 
We can average the lengths of clear intervals and the dis- 
tance between two consecutive clear intervals to obtain a 
relatively precise estimate. Armed with this information, 
we can then modify a new flow to remove the watermark. 


For ICBW, we have two choices: we can either shift 
traffic into the clear intervals in the template, thereby 
negating the squeezing action of the watermark, or find 
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Figure 12: Average rate of 10 flows after DSSS water- 
mark. 


intervals that have not been squeezed and squeeze them. 
We decided to implement the former approach since it 
does not require as precise an estimate of 7’. Also, it 
leaves the flow looking more natural. Our shift is imple- 
mented as shown in Figure 13(b), by shifting all packets 
in a period a before the clear interval into an interval 
of length ( inside the clear interval. Larger values of a 
and smaller values of @ will more significantly shift the 
interval centroid back in a different direction; however, 
very small values of @ may not have the desired effect, 
since the template is imprecise and too many packets 
may get shifted without arriving into the correct inter- 
val. Experimentally, we found that a = 0.9(T' — a) and 
G3 = 0.8(T — G) provide best results, where T and @ are 
estimated values of T' and a. 

Table 1 shows the results of watermark removal. 
We reimplemented the ICBW detection mechanism and 
computed the Hamming distance of the encoded water- 
mark to the detected one, collected over 100 flows. (We 
show the average distance, with range shown in paren- 
theses). With as few as 10 flows, we are able to get a 
reasonably good estimate of T’ and a and remove the wa- 
termark in most cases—the ICBW detection scheme uses 
a Hamming distance threshold of 5-8 to decide when a 
watermark has been detected. With 15 flows, we get a 
more accurate template and estimate, and all 100 flows 
will clear the template. 

A similar approach can be used to attack the IBW wa- 
termark; by delaying packets so that they fall into the 
clear intervals, the clear intervals become indistinguish- 
able from loaded ones. Table 2 shows the effect of apply- 
ing our attack on the IBW watermark, where 24 bits are 
encoded at different levels of redundancy. Even with a 
redundancy of 80, most bits are not recovered correctly. 
These results were obtained by using the code provided 
by the authors of [21]. 
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Figure 13: Watermark Removal 


We expect a similar technique should work against 
DSSS watermarks; a template of low rates can be in- 
ferred from several flows. An attacker can then de- 
crease rates in the non-interference section of the tem- 
plate by dropping packets, or increase the rate in the 
high-interference section by delaying packets into the 
template. We do not have experimental results for DSSS 
since the detection algorithm is fairly complex and we 
did not have access to an implementation of it. 


4.3 Multiple Messages 


So far we have assumed that the watermarks on all of 
the aggregated flows are the same. Here, we consider the 
case where each watermark uses different messages. As 
described in Section 3.3.2, we can still execute our attack 
by relying on the fact that within a collection of 24 — 1 
flows, for any given bit b, we can find k flows where this 
bit has the same value. 

Figure 14(a) plots the result of such a subset search. 
By inspection, we can see that in the first subset of 
flows, the interval (4.5,4.85) has been cleared. In the 
second subset, this interval remains cleared and the in- 
terval (0,0.35) becomes clear as well. The third subset 
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Table 1: Results for removing ICBW watermarks 
































Num a T Hamming Hamming Hamming | Ave. | Max 
flows not watermarked | watermarked | attacked | delay | delay 
10 365 492 17.9 2.67 13.9 33.6 164 
(o = 10.7) | (o = 15.2) (13-24) (1-7) (2-20) 
15 353 504 17.6 2.74 16.1 42.6 | 188.2 
(o = 0.60) | (o = 1.62) (13-25) (0-6) (12-21) 
20 346 504 17.2 2.68 16.4 45.4 | 194.3 
(o = 0.30) | (o = 0.50) (12-21) (0-5) (11-20) 














Table 2: Watermark bits detected before and after apply- 
ing the attack (watermark length is 24). 



























































Rep. Bits detected Marked 
Before attack | After attack | packets 
1 7 3 53 

5 14 5 156 
10 24 4 505 
15 24 2 754 
20 24 2 967 
24 24 2 1209 
30 24 2 1440 
35 24 2 1724 
4l 24 2 2008 
45 24 2 2307 
50 24 2 2697 
59 24 2 3083 
60 24 2 3296 
65 24 2 3623 
70 24 2 3876 
75 24 2 4090 
80 24 2 4343 




















has no packets in (2.0,2.35) and the fourth in (3.5,3.85). 
Note that this pattern immediately lets us detect the pres- 
ence of a watermark; Figure 14(b) shows the same flow 
subsets on an unwatermarked section. 


Recovery of the secret parameters can proceed largely 
as in the single-message case. One difficulty is that with 
the flow subsets, we may encounter large intervals that 
are not precisely aligned with the interval positions. For 
example, Table 3(a) lists the blank intervals longer than 
0.2s in the last subset. There are a lot of wrong-size in- 
tervals that result from the case when 8 or 9 of the flows 
in the subset have had an interval squeezed, but the last 
one or two add a few packets to the mix. To address this 
concern, we can select the largest empty intervals in any 
subset, as shown in Table 3(b). These will correspond to 
intervals that have been squeezed on every flow. This can 
be used to recover the watermark parameters of T' and a. 


Once these are obtained, the next step is to scan 
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(b) Un-watermarked flow subsets 


Figure 14: Subset approach to multiple message water- 
marks 


through all subsets and determine which intervals are al- 
ways squeezed at the same time and call such lists S;; 
these will correspond to either A, or B, for some bit b. 
Then, for each 5;, we find S; such that S; and S; are 
never squeezed at the same time. This will tell us that 
S; and S; correspond to the same bit. Armed with this 
knowledge, we can remove the watermark by observing 
the watermarked stream for a short while, and when we 
see intervals from S; that are being squeezed, we pro- 
ceed to artificially squeeze intervals in S; (or unsqueeze 
further intervals in S;, or both). 

Note that the subset technique can also be applied 
when not all the flows are watermarked. For example, a 
website may watermark only some connections that are 
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Table 3: Blank intervals from subset of flows 


(a) All blank intervals (b) Largest blank intervals 



































Start | End Start End 
2.08 | 2.32 130.98 | 131.35 
3.50 | 3.85 140.49 | 140.86 
4.03 | 4.25 151.99 | 152.36 
5.13 | 5.33 161.99 | 162.35 
11.59 | 11.85 235.99 | 236.37 
18.14 | 18.37 306.49 | 306.86 
19.56 | 19.79 334.49 | 334.86 
25.58 | 25.82 368.49 | 368.86 
30.06 | 30.34 43.99 | 44.36 
34.08 | 34.35 51.98 | 52.35 


























of particular interest; by finding subsets that are all wa- 
termarked, the mark can still be recovered. A scheme 
that probabilistically marked some flows and used dif- 
ferent messages at the same time would present a chal- 
lenge to our attack; however, we suggest that a different 
countermeasure be used, since it allows all flows to be 
marked, which is desirable for most applications. 


5 Countermeasures 


We next consider several countermeasures to our attack. 


5.1 Multiple Offsets 


The watermarking schemes we analyze have the ability 
to self-synchronize by trying different values for the off- 
set o and using the best match. Thus, o can be changed 
for different streams. The synchronization mechanism 
can introduce more errors into the detection, but the use 
of increased encoding redundancy can make up for it. 

The use of different offsets makes our attack more dif- 
ficult, since simply aggregating k flows will result in mis- 
alignment, destroying the clear intervals. It is, of course, 
possible to test different positions for o for each stream, 
but to test n positions in k flows requires n*~! trials (we 
can hold the first flow fixed). 

On the other hand, some alignments of two or three 
flows can be discarded immediately, if such an alignment 
results in few intervals that are clear of packets. Further- 
more, the search for o can be imprecise at first: even if 
each flow is aligned to within 0.1s of the correct position, 
intervals of 150ms or 700ms will be seen in the average. 
Thus, changing offsets makes our attack more difficult, 
but not impossible to perform. 
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5.2 Multiple Positions 


Another alternative is to choose different positions, in the 
case of ICBW and IBW, and different PN codes in the 
case of DSSS. Let us consider the case of ICBW. A wa- 
termarker and detector must use the same assignment of 
intervals to the sets A; and B;, as determined by the ran- 
dom seed s, in order for the watermark to be successfully 
recovered. However, a watermarker may decide to use 
multiple seed values, s1,...,S,, and pick one of them at 
random for each flow. 

To deal with this, the detector would need to try to 
recover the watermark with each possible s; and pick the 
best match. Once again, the probability of error grows 
with n, but increased redundancy can again be used to 
make up for it. Note that the probability of error falls 
exponentially with increased redundancy, but grows only 
roughly linearly with n. 

We can once again use the subset attack to try to find k 
flows that use the same seed value s;; however, the com- 


plexity grows quickly out of control. The probability of 


; ; . k-1 
a given set of k flows using the same seed is (+) : 


which falls quite quickly even when /&; = 10. By the pi- 
geon hole principle, within n(k — 1) + 1 flows we can 
always find a subset of & flows with the same seed, but 
the search space of all Cro) subsets grows super- 
exponentially in n. For example, with n = 6 and k = 10, 
es) > 101°, resulting in an infeasible number of subsets 
to enumerate. 

The same principle can apply to IBW, by picking mul- 
tiple sets of positions {s;}, and to DSSS by using multi- 
ple PN codes. 


6 Conclusion 


We have demonstrated an attack on the interval centroid- 
based watermarking scheme and interval based water- 
marking scheme that is highly successful, while requir- 
ing a low amount of resources. Our attack is based on a 
solid theoretical grounding, and has been validated with 
a prototype implementation tested against the original 
ICBW and IBW prototypes. We can remove the water- 
mark from an existing flow for both schemes. Addition- 
ally, in case of IBW we can also recover the watermark 
parameters and values, allowing us to modify the water- 
mark or insert it into other streams, thereby confusing 
the detector. We have also suggested a countermeasure 
to our attack—switching bit positions. This countermea- 
sure can impose a very high computation cost and there- 
fore disable the attack. 

While the use of network flow watermarking tech- 
niques for various security applications is quite new [27, 
21, 25, 28], digital watermarking. and specifically mul- 
timedia watermarking is a nearly mature field. Indeed, 
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most of network flow watermarking schemes are inspired 
by multimedia watermarks. To name a few,, Wang and 
Reeves’s [27] scheme is a special instance of QIM wa- 
termarking, a well-understood multimedia watermarking 
technique [16]. The IBW scheme of Pyun et al. [21] is 
based on the patchwork watermark of Bender et al. [4] 
and the scheme of Yu et al. [28] is based on spread spec- 
trum watermarking [9]. 

The current approach for designing network flow wa- 
termarks suffers from the fact that. while watermark- 
ing schemes are inspired by the digital watermarking 
schemes, little attention is given to the entirety of the 
watermarking design problem. For example, statistical 
characteristics of the underlying media are always an 
important consideration in digital watermarks, but net- 
work watermark research does not adequately model the 
effect that network traffic characteristics have on water- 
marks; as we showed, the density of bulk traffic makes 
it very difficult to insert a transparent watermark. Like- 
wise, digital watermarks have long considered the possi- 
bility that multiple watermarked documents can be used 
to attack watermarks [10, 17], but we are unaware of pre- 
vious work looking at the multi-flow threat model for 
watermarking. We thus hope that future work on wa- 
termarks will be informed by our work and perform a 
broader analysis. 
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Notes 


'We are unaware of a quantitative comparison of the accuracy of 
watermarking techniques with passive traffic analysis, but reported 
false-positive rates for most watermarking techniques are quite low. In 
any case, the two techniques can be combined to improve accuracy. 

?Yu et al. suggest that this can be done by sending an interfering 
flow across a bottleneck link; their scheme is thus unique in not requir- 
ing full control of packet forwarding for the flow. 
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Abstract 


In this paper, we present an approach for verifying that 
trusted programs correctly enforce system security goals 
when deployed. A trusted program is trusted to only 
perform safe operations despite have the authority to 
perform unsafe operations; for example, initialization 
programs, administrative programs, root network dae- 
mons, etc. Currently, these programs are trusted without 
concrete justification. The emergence of tools for build- 
ing programs that guarantee policy enforcement, such as 
security-typed languages (STLs), and mandatory access 
control systems, such as user-level policy servers, finally 
offers a basis for justifying trust in such programs: we 
can determine whether these programs can be deployed 
in compliance with the reference monitor concept. Since 
program and system policies are defined independently, 
often using different access control models, compliance 
for all program deployments may be difficult to achieve 
in practice, however. We observe that the integrity of 
trusted programs must dominate the integrity of system 
data, and use this insight, which we call the PIDSI ap- 
proach, to infer the relationship between program and 
system policies, enabling automated compliance verifi- 
cation. We find that the PIDSI approach is consistent 
with the SELinux reference policy for its trusted pro- 
grams. As a result, trusted program policies can be de- 
signed independently of their target systems, yet still be 
deployed in a manner that ensures enforcement of system 
security goals. 


1 Introduction 


Every system contains a variety of trusted programs. A 
trusted program is a program that is expected to safely 
enforce the system’s security goals despite being autho- 
rized to perform unsafe operations (i.e., operations that 
can potentially violate those security goals). For exam- 
ple, the X Window server [37] is a trusted program be- 
cause enables multiple user processes to share access to 
the system display, and the system trusts it to prevent 
one user’s data from being leaked to another user. A sys- 
tem has many such trusted processes, including those for 
initialization (e.g., init scripts), administration (e.g., 
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software installation and maintenance), system services 
(e.g., windowing systems), authentication services (e.g., 
remote login), etc. The SELinux system [27] includes 
over 30 programs specifically-designated as trusted to 
enforce multilevel security (MLS) policies [14]. 

An important question is whether trusted programs ac- 
tually enforce the system’s security goals. Trusted pro- 
grams can be complex software, and they traditionally 
lack any declarative access control policy governing their 
behavior. Of the trusted programs in SELinux, only the 
X server currently has an access control policy. Even in 
this case, the system makes no effort to verify that the X 
server policy corresponds to the system’s policy in any 
way. Historically, only formal assurance has been used 
to verify that a trusted program enforces system secu- 
rity goals, but current assurance methodologies are time- 
consuming and manual. As a result, trusted programs 
are given their additional privileges without any concrete 
justification. 

Recently, the emergence of techniques for building 
programs with declarative access control policies moti- 
vates us to develop an automated mechanism to verify 
that such programs correctly enforce security goals. Pro- 
grams written in security-typed languages [23, 26, 28] or 
integrated with user-level policy servers [34] each in- 
clude program-specific access control policies. In the 
former case, the successful compilation of the program 
proves its enforcement of an associated policy. In the lat- 
ter case, the instrumentation of the program with a pol- 
icy enforcement aims to ensure comprehensive enforce- 
ment of mandatory access control policies. In general, 
we would want such programs to enforce system secu- 
rity goals, in which case we say that the program com- 
plies with the system’s security goals. 

We use the classical reference monitor concept [1] as 
the basis for the program’s compliance requirements!: 
(1) the program policy must enforce a policy that rep- 
resents the system security goals and (2) the system pol- 
icy must ensure that the program cannot be tampered. 
We will show that both of these problems can be cast as 
policy verification problems, but since program policies 
and system policies are written in different environments, 
often considering different security goals, they are not 
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directly comparable. For example, program policy lan- 
guages can differ from the system policy language. For 
example, the security-typed language Jif [26] uses an in- 
formation flow policy based on the Decentralized Label 
Model [24], but the SELinux system policy uses an ex- 
tended Type Enforcement policy [5] that includes multi- 
level security labeling [2]. Even where program policies 
are written for SELinux-compatible policy servers [34], 
the set of program labels is often distinct from the set of 
system labels. In prior work, verifiably-compliant pro- 
grams were developed by manually joining a system pol- 
icy with the program’s policy and providing a mapping 
between the two [13]. To enable general programs to be 
compliant, our goal is to develop an approach by which 
compliant policy designs can be generated and verified 
automatically. 

As a basis for an automated approach, we observe that 
trusted programs and the system data upon which it op- 
erates have distinct security requirements. For a trusted 
program, we must ensure that the program’s components, 
such as its executable files, libraries, configuration, etc., 
are protected from tampering by untrusted programs. For 
the system data, the system security policy should ensure 
that all operations on that data satisfy the system’s se- 
curity goals. Since trusted programs should enforce the 
system’s security goals, their integrity must dominate the 
system data’s integrity. If the integrity of a trusted pro- 
gram is compromised, then all system data is at risk. Us- 
ing the insight that program integrity dominates system 
integrity, we propose the PIDSI approach to designing 
program policies, where we assign trusted program ob- 
jects to a higher integrity label than system data objects, 
resulting in a simplified program policy that enables au- 
tomated compliance verification. Our experimental re- 
sults justify that this assumption is consistent with the 
SELinux reference policy for its trusted programs. As 
a result, we are optimistic that trusted program policies 
can be designed independently of their target systems, 
yet still be deployed in a manner that ensures enforce- 
ment of system security goals. 

After providing background and motivation for the 
policy compliance problem in Section 2, we detail the 
following novel contributions: 


1. In Section 3, we define a formal model for policy 
compliance problem. 

2. In Section 4, we propose an approach called Pro- 
gram Integrity Dominates System Integrity (PIDSI) 
where trusted programs are assigned to higher in- 
tegrity labels than system data. We show that com- 
pliance program policies can be composed by relat- 
ing the program policy labels to the system policy 
on the target system using the PIDSI approach. 

3. In Section 5.1, we describe policy compliance tools 
that automate the proposed PIDSI approach such 
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that a trusted program can be deployed on an exist- 
ing SELinux system and we can verify enforcement 
of system security goals. 

4. In Section 5.2, we show the trusted programs for 
which there are Linux packages in SELinux are 
compatible with the PIDSI approach with a few ex- 
ceptions. We show how these can be resolved using 
a few types of simple policy modifications. 


This work is the first that we are aware of that enables 
program and system security goals to be reconciled in a 
scalable (automated and system-independent) manner. 


2 Background 


The general problem is to develop an approach for build- 
ing and deploying trusted programs, including their ac- 
cess control policies, in a manner that enables automated 
policy compliance verification. In the section, we specify 
the current mechanisms for these three steps: (1) trusted 
program policy construction; (2) trusted program deploy- 
ment; and (3) trusted program enforcement. We will use 
the SELinux system as the platform for deploying trusted 
programs. 


2.1 Program Policy Construction 


There are two major approaches for constructing pro- 
grams that enforce a declarative access control policy: 
(1) security-typed languages [26, 28, 33] (STLs) and (2) 
application reference monitors [22, 34]. These two ap- 
proaches are quite different, but we aim to verify policy 
compliance for programs implemented either way. 
Programs written in an STL will compile only if their 
information flows, determined by type inferencing, are 
authorized by the program’s access control policy. As 
a result, the STL compilation guarantees, modulo bugs 
in the program interpreter, that the program enforces the 
specified policy. As an example, we consider the Jif STL. 
A Jif program consists of the program code plus a pro- 
gram policy file [12] describing a Decentralized Label 
Model [24] policy for the program. The Jif compiler en- 
sures that the policy is enforced by the generated pro- 
gram. We would use the policy file to determine whether 
Jif program complies with the system security goals. 
For programs constructed with application reference 
monitors, the program includes a reference monitor in- 
terface [1] which determines the authorization queries 
that must be satisfied to access program operations. The 
queries are submitted to a reference monitor component 
that may be internal or external to the program. The use 
of a reference monitor does not guarantee that the pro- 
gram policy is correctly enforced, but a manual or semi- 
automated evaluation of the reference monitor interface 
is usually performed [17]. As an example, we consider 
the SELinux Policy Server [34]. A program that uses the 
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SELinux Policy Server, loads a policy package contain- 
ing its policy into the SELinux Policy Server. The pro- 
gram is implemented with its own reference monitor in- 
terface which submits authorization queries to the Policy 
Server. We note that the programs that use an SELinux 
Policy Server may share labels, such as the labels of the 
system data, with other programs. 

As an example, we previously reimplemented one of 
the trusted programs in an SELinux/MLS distribution, 
logrotate, using the Jif STL [13]. logrotate 
ages logs by writing them to new files and is trusted 
in SELinux/MLS because it can read and write logs 
of multiple MLS secrecy levels. Our experience from 
logrotate is that ensuring system security goals re- 
quires the trusted program to be aware of the system’s 
label for its data. For example, if logrotate accesses 
a log file, it should control access to the file data based 
on the system (e.g., SELinux) label of that file. We man- 
ually designed the logrotate information flow pol- 
icy to use the SELinux labels and the information flows 
that they imply. Further, since logrotate variables 
may also originate from program-specific data, such as 
configurations, in addition to system files, the informa- 
tion flow policy had to ensure that the information flows 
among system data and program data was also correct. 
As a result, the information flow policy required a man- 
ual merge of program and system information flow re- 
quirements. 


2.2 Program Deployment 


We must also consider how trusted programs are de- 
ployed on systems to determine what it takes to verify 
compliance. In Linux, programs are delivered in pack- 
ages. A package is a set of files including the executable, 
libraries, configuration files, etc. A package provides 
new files that are specific to a program, but a program 
may also depend on files already installed in the sys- 
tem (e.g., system shared libraries, such as 1ibc). Some 
packages may also export files that other packages de- 
pend on (e.g., special libraries and infrastructure files 
used by multiple programs). 

For a trusted program, such as logrotate, we ex- 
pect that a Linux package would include two additional, 
noteworthy files: (1) the program policy and (2) the 
SELinux policy module”. The program policy is the file 
that contains the declarative access control policy to be 
enforced by the program’s reference monitor or STL im- 
plementation, as described above. 

In SELinux, the system policy is now composed from 
the policy modules. SELinux policy modules specify the 
contribution of the package to the overall SELinux sys- 
tem policy [20]. While SELinux policy modules are spe- 
cific to programs, they are currently designed by expert 
system administrators. Our logrotate program pol- 
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icy is derived from the program’s SELinux policy mod- 
ule, and we envision that program policies and system 
policy modules will be designed in a coordinated way 
(e.g., by program developers rather than system admin- 
istrators) in the future, although this is an open issue. 

An SELinux policy module consists of three compo- 
nents that originate from three policy source files. First, 
a .te file defines a set of new SELinux types? for this 
package. It also defines the policy rules that govern pro- 
gram accesses to its own resources as well as system re- 
sources. Second, a . fc file specifies the assignment of 
package files to SELinux types. Some files may use types 
that are local to the policy module, but others may be as- 
signed types defined previously (e.g., system types like 
etc_t is used for files in /etc). A .if file defines a set 
of interfaces that specify how other modules can access 
objects labeled with the types defined by this module. 

When a package is installed, its files are downloaded 
onto the system and labeled based on the specification 
in the . fc file or the default system specification. Then, 
the trusted program’s module policy is integrated into the 
SELinux system policy*, enabling the trusted program to 
access system objects and other programs to access the 
trusted program’s files. There are two ways that another 
program can access this package’s files: (1) because a 
package file is labeled using an existing label or (2) an- 
other module is loaded that uses this module’s interface 
or types. As both are possible for trusted programs, we 
must be concerned that the SELinux system policy may 
permit an untrusted program to modify a trusted pro- 
gram’s package file. 

For example, the logrotate package includes 
files for its executable, configuration file, documen- 
tation, man pages, execution status, etc. Some 
of these files are assigned new SELinux types de- 
fined by the logrotate policy module, such as 
the executable (logrotate_exec_t) and its status 
file (logrotate_var_lib_t), whereas others are as- 
signed existing SELinux types, such as its configuration 
file (etc_t). The logrotate policy module uses sys- 
tem interfaces to obtain access to the system data (e.g., 
logs), but no other processes access logrotate inter- 
faces. As a result, logrotate is only vulnerable to 
tampering because some of the system-labeled files that 
it provides may be modified by untrusted processes. 

We are also concerned that a logrotate process 
may be tampered by the system data that it uses (e.g., 
Biba read-down [4]). For example, logrotate may 
read logs that contain malicious data. We believe that 
systems and programs should provide mechanisms to 
protect themselves from the system data that they pro- 
cess. Some interesting approaches have been proposed to 
protect process integrity [19,30], so we consider this an 
orthogonal problem that we do not explore further here. 
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Figure 1: Deployment and Installation of a trusted package. First, we check two compliance goals: (1) the system 
protects the application and (2) the application enforces system goals. Second the package is installed: the policy 
module is integrated into the system policy and application files are installed. 


2.3 Program Enforcement 


To justify a system’s trust, any trusted program must en- 
force a policy that complies with the system’s security 
goals. The reference monitor concept [1] has been the 
guide for determining whether a system enforces its se- 
curity goals, and we leverage this concept in defining 
compliance. A reference monitor requires three guaran- 
tees to be achieved: (1) complete mediation must ensure 
that all security-sensitive operations are authorized; (2) 
a reference monitor must be tamperproof to enforce its 
policy correctly; and (3) a reference monitor must be 
simple enough to verify enforcement of security goals. 
While the reference monitor concept is most identified 
with operating system security, a trusted program must 
also satisfy these guarantees to ensure that a system’s se- 
curity goal is enforced. As a result, we define that a pro- 
gram enforces a system’s security goals if it satisfies the 
reference monitor guarantees in its deployment on that 
system. 

In prior work, we developed a verification method that 
partially fulfilled these requirements. We developed a 
service, called SIESTA, that compares program policies 
against SELinux system policies, and only executes pro- 
grams whose policies permit information flows autho- 
rized in the system policy [13]. This work considered 
two of the reference monitor guarantees. First, we used 
the SIESTA service to verify trust in the Jif STL imple- 
mentation of the logrotate program. Since the Jif 
compiler guarantees enforcement of the associated pro- 
gram policy, this version of logrotate provides com- 
plete mediation, modulo the Java Virtual Machine. Sec- 
ond, SIESTA performs a policy analysis to ensure that 
the program policy complies with system security goals 
(i.e., the SELinux MLS policy). Compliance was defined 
as requiring that the logrotate policy only authorized 
an operation if the SELinux MLS policy) also permitted 
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that operation. Thus, SIESTA is capable of verifying a 
program’s enforcement of system security goals. 

We find two limitations to this work. First, we had to 
construct the program access control policy relating sys- 
tem and program objects in an ad hoc manner. As the 
resultant program policy specified the union of the sys- 
tem and program policy requirements, it was much more 
complex than we envisioned. Not only is it difficult to 
design a compliant program access control policy, but 
that policy may only apply to a small number of target 
environments. As we discussed in Section 2.1, program 
policies should depend on system policies, particularly 
for trusted programs that we expect to enforce the sys- 
tem’s policy, making them non-portable unless we are 
careful. Second, this view of compliance does not pro- 
tect the trusted program from tampering. As described 
above, untrusted programs could obtain access to the 
trusted program’s files after the package is installed, if 
the integrated SELinux system policy authorizes it. For 
example, if an untrusted program has write access to the 
/etc directory where configuration files are installed, 
as we demonstrated was possible in Section 2.2, SIESTA 
will not detect that such changes are possible. 

In summary, Figure | shows that we aim to define an 
approach that ensures the following requirements: 


e For any system deployment of a trusted program, 
automatically construct a program policy that is 
compliant with the system security goals, thus satis- 
fying the reference monitor guarantee of being sim- 
ple enough to verify. 

e For any system deployment of a trusted program, 
verify, in a mostly-automated way, that the system 
policy does not permit tampering of the trusted pro- 
gram by any untrusted program, thus satisfying the 
reference monitor guarantee of being tamperproof. 
The typical number of verification errors must be 
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Figure 2: The two policy compliance problems: (1) ver- 
ify that the program policy complies with the system’s 
information flow goals and (2) verify that the system pol- 
icy, including the program contribution (e.g., SELinux 
policy module), enforces the tamperproofing goals of the 
program. 


small and there must be a set of manageable resolu- 
tions to any such errors. 


In the remainder of the paper, we present a single ap- 
proach that solves both of these requirements. 


3 Policy Compliance 


Verification of these two trusted program requirements 
results in the same conceptual problem, which we call 
policy compliance problems. Figure 2 shows these two 
problems. First, we must show that the program policy 
only authorizes operations permitted by the system’s se- 
curity goal policy. While we have shown a method by 
which such compliance can be tested previously [13, 14], 
the program policy was customized manually for the sys- 
tem. Second, we also find that the system policy must 
comply with the program’s tamperproof goals. That is, 
the system policy must not allow any operation that per- 
mits tampering the trusted program. As a result, we need 
to derive the tamperproof goals from the program (e.g., 
from the SELinux policy module). 

In this section, we define the formal model for veri- 
fying policy compliance suitable for both the problems 
above. However, as can be seen from Figure 2, the 
challenge is to develop system security goals, program 
policies, and tamperproof goals in a mostly-automated 
fashion that will encourage successful compliance. The 
PIDSI approach in Section 4 provides such guidance. 


3.1 Policy Compliance Model 


We specify system-wide information-flow goals as a se- 
curity lattice £. We assume that elements of £ have both 
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an integrity and a confidentiality component: this is the 
case for both MLS labels in SELinux [11] and labels 
from the DLM [25]. Let Integrity(/) and Conf(/) be the 
integrity and confidentiality projections of a label / € ZL, 
respectively. Let the lattice £ have both a top element, 
T, and a bottom element |. We use high = Integrity(_L) 
and low = Integrity(T) to denote high and low integrity 
and write high E low to indicate that high integrity data 
can flow to a low integrity security label, but not the re- 
verse. 

An information-flow graph is a directed graph G = 
(V, E) where V is the set of vertices, each associated 
with a label from a security lattice £. We write V(G) for 
the vertices of G and E(G) for the edges of G, and for 
uv € V(G) we write Type(v) for the label on the vertex 
v. Both subjects (e.g., processes and users) and objects 
(e.g., files and sockets) are assigned labels from the same 
security lattice £. The edges in G describe the informa- 
tion flows that a policy permits. 

We now formally define the the concept of compliance 
between a graph G and a security lattice £. For u,v € 
V(G), we write u ~» v if there is a path between vertices 
u and v in the graph G. An information-flow graph G' is 
compliant with a security lattice £ if all paths through the 
combined information-flow graph imply that there is a 
flow in £ between the types of the elements in the graph. 





Definition 3.1 (Policy Compliance). An information- 
flow graph G is compliant with a security lattice £ i 
for each u,v € V(G) such that uw ~» v, then Type(z) 
Type(v) in the security lattice L. 


a 





With respect to MAC policies, a positive result of the 
compliance test implies that the information-flow graph 
for a policy does not permit any operations that violate 
the information-flow goals as encoded in the lattice L. 
If G is the information flow graph of a trusted program 
together with the system policy, then a compliance test 
verifies that the trusted program only permits informa- 
tion flows allowed by the operating system, as we desire. 


3.2 Difficulty of Compliance Testing 


The main difficulty in compliance testing is in automati- 
cally constructing the program, system, and goal policies 
shown in Figure 2. Further, we prefer design construc- 
tions that will be likely to yield successful compliance. 
The two particularly difficult cases are the program 
policies (i.e., upper left in the figure) and the tamper- 
proof goal policy (i.e. lower right in the figure). The pro- 
gram policy and tamperproof goal policies require pro- 
gram requirements to be integrated with system require- 
ments, whereas the system policy and system security 
goals are largely (although not necessarily completely) 
independent of the program policy. For example, while 
the system policy must include information flows for the 
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program, the SELinux system includes policy modules 
for the Logrotate and other trusted programs that can 
be combined directly. 

First, it is necessary for program policies (i.e., upper 
left in the figure) to manage system objects, but often 
program policy and system policy are written with dis- 
joint label sets. Thus, some mapping from program la- 
bels to system labels is necessary to construct a system- 
aware program policy before the information flow goals 
encoded in £ can be evaluated. Let P be an information- 
flow graph relating the program subjects and objects and 
and S' be information-flow graph relating the system sub- 
jects and objects. Let P@S'be the policy that arises from 
combining P and S to form one information-flow graph 
through some sound combination operator &; that is, if 
there is a runtime flow in the policy S where the program 
P has been deployed, then there is a flow in the informa- 
tion flow graph P & S. Currently, there are no automatic 
ways to combine such program and system graphs into 
a system-aware program policy, meaning that @ is im- 
plemented in a manual fashion. A manual mapping was 
used in previous work on compliance [13]. 

Second, the tamperproof goal policy (1.e., lower right 
in the figure) derives from the program’s integrity re- 
quirements for its objects. Historically, such require- 
ments are not explicitly specified, so it is unclear which 
program labels imply high integrity and which files 
should be assigned those high integrity labels. With the 
use of packages and program policy modules, the pro- 
gram files and labels are identified, but we still lack in- 
formation about what defines tamperproofing for the pro- 
gram. Also, some program files may be created at in- 
stallation time, rather than provided in packages, so the 
integrity of these files needs to be determined as well. 
We need a way to derive tamperproof goals automati- 
cally from packages and policy modules. 


4 PIDSI Approach 


We propose the PIDSI approach (Program Integrity 
Dominates System Integrity), where the trusted pro- 
gram objects (i.e., package files and files labeled using 
the labels defined by the module policy) are labeled such 
that their integrity is relative to all system objects. The 
information flows between the system and the trusted 
program can then be inferred from this relationship. We 
have found that almost all trusted program objects are 
higher integrity than system objects (i.e., system data 
should not flow to trusted program objects). One excep- 
tion that we have found is that both trusted and untrusted 
programs are authorized to write to some log files. How- 
ever, a trusted program should not depend on the data in 
a log file. While general cases may eventually be iden- 
tified automatically as low integrity, at present we may 
have a small number of cases where the integrity level 
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Figure 3: The PIDSI approach relates program labels P 
to system labels S, such that the program-defined ob- 
jects are higher integrity than the system data objects (as- 
signed to H), with some small number of low integrity 
exceptions (assigned to L). 


must be set manually. 

Our approach takes advantage of a distinction between 
the protection of the trusted program and protection of 
the data to which it is applied. Trusted program pack- 
ages contain the files necessary to execute the program, 
and the integrity of the program’s execution requires pro- 
tection of these files. On the other hand, the program is 
typically applied to data whose protection requirements 
are defined by the system. 


4.1 PIDSI Definition 


By using the PIDSI approach between trusted program 
and the system, we can deploy that trusted program on 
different systems, ensuring compliance. Figure 3 demon- 
strates this approach. First, the program defines its own 
set of labels, which are designed either as high or low 
integrity. When the program is deployed, the system la- 
bels are placed in between the program’s high and low 
integrity labels. This allows an easy check of whether a 
program is compliant with the system’s policy, regardless 
of the specific mappings from system inputs and outputs 
to program inputs and outputs. 

In the event that the trusted program allows data at a 
low integrity label to flow to a high label, then this ap- 
proach can trick the system into trusting low integrity 
data. To eliminate this possibility, we automatically ver- 
ify that no such flows are present in the program policy. 

For confidentiality, we found that the data stored by 
most trusted programs was intended to be low secrecy. 
The only exception to this rule that we found in the 
trusted program core of SELinux was sshd; this pro- 
gram managed SSH keys at type sshd_key_t, which 
needed to be kept secret®. We note that if program data 
is low secrecy as well as high integrity the same infor- 
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mation flows result, system data may not flow to program 
data, so no change is required to the PIDSI approach. Be- 
cause of this, we primarily evaluate the PIDSI approach 
with respect to integrity. 

In this context, the compliance problem requires 
checking that the system’s policy, when added to the pro- 
gram, does not allow any new illegal flows. We con- 
struct the composed program policy P’ from P and S. 
To composte P and S into P’ = P @ S, first, split P 
into subgraphs H and L as follows: if wu € P is such that 
Integrity(Type(u)) = high, then u € H, andifu € P 
is such that Integrity(Type(u)) = low, then u € LD. P’ 
contains copies of S, H, L, with edges from each vertex 
in H to each vertex in S, and edges from each vertex in 
S to L. The constructed system policy P’ corresponds to 
the deployment of the program policy P on the system 
S. 


Theorem 4.1. Assume for all v €  V(P), 
Conf(Type(v)) = Conf(L). Given test policy P 
and target policy S, if for allu € H, v € L, there is no 
edge (v, u) € P, then the test policy P is compliant with 
the constructed system policy P’. 


Given the construction, the only illegal flow that can 
exist in P’ is from a vertex v € L, which has a low in- 
tegrity label, to one of the vertices u € H, which has a 
high integrity label. The graph S is compliant with P’ 
by definition, and the edges that we add between sub- 
graphs are from H to S and S to L: these do not upgrade 
integrity. 

We argue that the PIDSI approach is consistent with 
the view of information flows in the trusted programs of 
classical security models. For example, MLS guards are 
trusted to downgrade the secrecy of data securely. Since 
an MLS guard must not lower the integrity of any down- 
graded data, it is reasonable to assume that the integrity 
of an MLS guard must exceed the system data that it pro- 
cesses. In the Clark-Wilson integrity model [7], only 
trusted transformation procedures (TPs) are permitted 
to modify high integrity data. In this model, TPs must 
be certified to perform such high integrity modifications 
securely. Thus, they also correspond to our notion of 
trusted programs. We find that other trusted programs, 
such as assured pipelines [5], also have a similar rela- 
tionship to the data that they process. 


4.2 PIDSTin Practice 


In this section, we describe how we use the PIDSI ap- 
proach to construct the two policy compliance prob- 
lems defined in Section 3 for SELinux trusted pro- 
grams. Our proposed mechanism for checking compli- 
ance of a trusted program during system deployment 
was presented in Figure 1: we now give the specifics 
how this procedure would work during an installation of 
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Figure 4: logrotate instantiation for the two policy 
compliance problems: (1) the program policy is derived 
using the PIDSI approach and the SELinux MLS pol- 
icy forms the system’s information flow goals and (2) 
the system policy is combined with the logrotate 
SELinux policy module and the tamperproofing goals 
are derived from the logrotate Linux package. 


logrotate. Figure 4 shows how we construct both 
problems for logrotate onan SELinux/MLS system. 
For testing compliance against the system security goals, 
we use the PIDSI approach to construct the logrotate 
program policy and use the SELinux/MLS policy for the 
system security goals. For testing compliance against 
the tamperproof goals, we use the SELinux/MLS pol- 
icy that includes the logrotate policy module for the 
system policy and we construct the tamperproof goal pol- 
icy from the logrotate package. We argue why these 
constructions are satisfactory for deploying trusted pro- 
grams, using logrotate on SELinux/MLS as an ex- 
ample. 

For system security goal compliance, we must show 
that the program policy only permits information flows 
in the system security goal policy. We use the PIDSI 
approach to construct the program policy as described 
above. For the Jif version of logrotate, this en- 
tails collecting the types (labels) from its SELinux policy 
module, and composing a Jif policy lattice where these 
Jif version of these labels are higher integrity (and lower 
secrecy) than the system labels. Rather than adding each 
system label to the program policy, we use a single la- 
bel as a template to represent all of the SELinux/MLS 
labels [13]. We use the SELinux/MLS policy for the 
security goal policy. This policy clearly represents the 
requirements of the system, and logrotate adds no 
additional system requirements. While some trusted pro- 
grams may embody additional requirements that the sys- 
tem must uphold (e.g., for individual users), this is not 
the case for logrotate. As a result, to verify compli- 
ance we must show that there are no information flows in 
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the program policy from system labels to program labels, 
a problem addressed by previous work [13]. 

For tamperproof goal compliance, we must show that 
the system policy only permits information flows that 
are authorized in the tamperproof goal policy. The sys- 
tem policy includes the logrotate policy module, as 
the combination defines the system information flows 
that impact the trusted program. The tamperproof pol- 
icy is generated from the logrotate package and its 
SELinux policy module. The logrotate package 
identifies the labels of files used in the logrotate program. 
In addition to these labels, any new labels defined by 
the Logrotate policy module, excepting process la- 
bels which are protected differently as described in Sec- 
tion 2.2, are also added to the tamperproof policy. The 
idea is that these labels may not be modified by untrusted 
programs. That is, untrusted process labels may not have 
any kind of write permission to the logrotate labels. 
Unlike security goal compliance, the practicality of tam- 
perproof compliance is clear. It may be that system poli- 
cies permit many subjects to modify program objects, 
thus making it impossible to achieve such compliance. 
Also, it may be difficult to correctly derive tamperproof 
goal policies automatically. In Section 5, we show pre- 
cisely how we construct tamperproof policies and test 
compliance, and examine whether tamperproof compli- 
ance, as we have defined it here, is likely to be satisfied 
in practice. 


5 Verifying Compliance in SELinux 


In this section, we evaluate the PIDSI approach against 
actual trusted programs in the SELinux/MLS system. 
As we discussed in Section 4.2, we want to determine 
whether it is possible to automatically determine tam- 
perproof goal policies and whether systems are likely 
to comply with such policies. First, we define a 
method for generating tamperproof goal policies auto- 
matically and show how compliance is evaluated for the 
logrotate program. Then, we examine whether eight 
other SELinux trusted programs meet satisfy tamper- 
proof compliance as well. This group of programs was 
selected because: (1) they are considered MLS-trusted 
in SELinux and (2) these programs have Linux packages 
and SELinux policy modules. Our evaluation finds that 
there are only 3 classes of exceptions that result from our 
compliance checking for all of these evaluated packages. 
We identify straightforward resolutions for each of these 
exceptions. As a result, we find that the PIDSI approach 
appears promising for trusted programs in practice. 


5.1 Tamperproof Compliance 


To show how tamperproof compliance can be checked, 
we develop a method in detail for the logrotate pro- 
gram on a Linux 2.6 system with a SELinux/MLS strict 
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reference policy. To implement compliance checking 
with the tamperproof goals, we construct representations 
of the system (SELinux/MLS) policy and the program’s 
tamperproof goal policy. Recall from Section 3 that all 
the information flows in the system policy must be au- 
thorized by the tamperproof goal policy for the policy to 
comply. 


5.1.1 Build the Tamperproof Goal Policy 


To build the tamperproof goal policy, we build an 
information-flow graph that relates the program labels to 
system labels according to the PIDSI approach. Building 
this graph consists of the following steps: 


1. Find the high integrity program labels. 

2. Identify the trusted system subjects. 

3. Add information flow edges between the program 
labels, trusted subject labels, and remaining (un- 
trusted) SELinux/MLS labels authorized by the 
PIDSI approach. 


Find the high integrity program labels. This step en- 
tails collecting all the labels associated with the pro- 
gram’s files, as these will all be high integrity per the 
PIDSI approach. These labels are a union of the pack- 
age file labels determined by the file contexts (. fc file 
in the SELinux policy module and the system file con- 
text) and the newly-defined labels in the policy module 
itself. First, the Logrotate package includes the files 
indicated in Table 1. This table presents lists a set of 
files, the label assigned to each, whether such label is 
a program label (i.e., defined by the program’s policy 
module) or a system label, and the result of the tamper- 
proof compliance check, described below. Second, some 
program files may be generated after the package is in- 
stalled. These will be assigned new labels defined in the 
program policy module. An example of a logrotate 
label that will be assigned to a file that is not included 
in the package is logrotate_lock-_t. In Section 6, 
we discuss other system files that a trusted program may 
depend upon. 


Identify trusted subjects. Trusted subjects are 
SELinux subjects that are entrusted with write permis- 
sions to trusted programs. Based on our experience 
in analyzing SELinux/MLS, we identify the following 
seven trusted subjects: dpkg_script_t, dpkg-t, 
portage_t, rpm_script_t, rpmt, sysadmt, 
prelink-_t. These labels represent package managers 
and system administrators; package managers and 
system administrators must be authorized to modify 
trusted programs. These subjects are also trusted by 
programs other than logrotate. We would want to 
control what code is permitted to run as these labels, but 
that is outside the scope of our current controls. 
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File SELinux Type Policy | Writers | Exceptions 
/etc/logrotate.conf etc_t system 18 integrity 
/etc/logrotate.d etc_t system 18 integrity 
/usr/sbin/logrotate logrotate_exec_t module 8 no 
/usr/share/doc/logrotate/CHANGES usr_t system 7 no 
/usr/share/man/logrotate.gz man_t system 8 no 
/var/lib/logrotate.status logrotate_var_lib-t | module 8 no 








Table 1: logrotate Compliance Test Case and Results: there are two exceptions, but they originate from the same 
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Figure 5: Part of the tamperproof goal policy’s 
information-flow graph for logrotate. Only trusted 
labels (dotted line circles) and the program labels them- 
selves are allowed to write to files with the program 
labels (solid line circles), which represent the high- 
integrity files according to the PIDSI approach. Not 
shown: edges from the trusted subjects to each of the 
program labels to the the right. 
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Add information flow edges. This step involves 
adding edges between vertices (labels) in the tamper- 
proof goal information-flow graph based on the PIDSI 
approach. The PIDSI approach allows program labels 
to read and write each other, but the only SELinux/MLS 
labels that may write program labels are the trusted sub- 
jects (and read as well). Other SELinux labels are re- 
stricted to reading the program labels only. Figure 5 
presents an example of a tamperproof goal policy’s 
information-flow graph. Notice that only the system 
trusted labels (dotted circles) are allowed to write to pro- 
gram labels (solid line circles). The application has high 
integrity requirements for et c_t; the graph therefore in- 
cludes edges that represent these requirements. The same 
set of edges are also added for the other program labels 
(presented to the right in the figure). 


5.1.2 Build the System Policy 


The system policy is represented as an information-flow 
graph (see Section 3). Building this graph consists of the 
following steps: 


1. Create an information-flow graph that represents the 
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current SELinux/MLS policy. 

2. Add logrotate program’s information flow ver- 
tices and edges based on its SELinux policy mod- 
ule. 

3. Remove edges where neither vertex is in the tam- 
perproof goal policy. 


Create an information flow graph. We convert the 
current SELinux/MLS policy into an information-flow 
graph. Each of the labels in the SELinux/MLS policies 
is converted to a vertex. Information-flow edges are 
created by identifying read-like and write-like permis- 
sions [10, 29] for subject labels to objects labels. The 
following example illustrates the process we follow to 
create a small part of the graph. Rules 1-3 and 6 are 
system rules, rules 4-5 are module rules (defined in the 
logrotate policy module). 


1. allow init _t init_var_run_t:file 
{create getattr read append write 
setattr unlink}; 

2. allow init_t bin_t:file 
{{read getattr lock execute ioctl} 
execute_no_trans}; 

3. allow init_t etc_t:file 
{read getattr lock ioctl}; 

4. allow logrotate_t etc_t:file 
{read getattr lock ioctl}; 

5. allow logrotate t bint:file 
{{read getattr lock execute ioctl} 
execute_no_trans}; 

6. allow chfn_t etc_t:file 
{create ioctl read getattr write 





setattr append link unlink rename}; 


Figure 6 shows the result of the parsing of the previ- 
ous rules. In this example, subjects with type init_t 
are allowed to read from and write to init_var_run_t 
and logrotate_t is allowed to read from et c_t and 
bin-t. 

We note that Figure 6 shows that chfn_t has write 
access to etc_t which logrotate_t can read. While 
logrotate cannot write any file with the label et c_t, 
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Figure 6: Information-flow graph for the system pol- 
icy, including the logrotate program’s policy mod- 
ule. chfn_t is not trusted to modify other trusted pro- 
grams, but it has write access to logrotate’s files la- 
beled etc_t. 


it provides such a file via its package installation, so it 
depends on the integrity of files of the label. This will be 
identified as a tamperproof compliance exception below. 

We are able to parse the text version of an SELinux 
policy (file policy.conf) with a C program inte- 
grated with Flex and Bison. We are also able to analyze 
the binary version of the SELinux system policy. 


Add logrotate program’s information flows. In 
a similar fashion to the method above, we extend the 
information flow graph with the vertices (labels) and 
edges (read and write flows) from the logrotate pol- 
icy module. 


Remove edges where neither vertex is in the tamper- 
proof goal policy. As these flows cannot tamper the 
logrotate program, we remove these edges from the 
system policy for compliance testing. 


5.1.3 Evaluating logrotate 


This section presents how we automatically test tam- 
perproof compliance. Tamperproof compliance is based 
checking the system policy for information flow integrity 
as defined by the tamperproof goal policy. 

Integrity Compliance Checking. To detect integrity vi- 
olations, we identify information flows that violate the 
Biba integrity requirement [4]: an information flow from 
a low integrity label (type in SELinux) to a high in- 
tegrity label. read and write arguments are subject 
and object. 


NonBibaF lows setinu(Policy) = 
{(t1, ta) : ti, te € types(Policy). highintegrity(t1)A 
lowintegrity(t2) A (read(t1, t2) V write(te, t1))} 
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We use the XSB Prolog engine [32] as the underlying 
platform. We developed a set of prolog queries based on 
the NonBiba Flows rule to detect the labels that affect 
compliance (i.e., the high integrity requirement that are 
not enforced by the system policy). 

As mentioned in the previous section, we evaluate 
tamperproof compliance at installation time. Each time 
we load the policy graphs generated above into the Pro- 
log engine and we run the integrity Prolog queries to 
determine if any flows satisfy (negatively) the NonBiba 
Flows, thus violating compliance. 


Results. Table 1 presents the results for compliance 
checking logrotate against the generated tamper- 
proof goal policy (see column 4). Only et c_t has unau- 
thorized writers. In the SELinux/MLS reference pol- 
icy, these writers are programs with legitimate reasons 
to write to files in the /etc directory, but none have le- 
gitimate reasons to write to logrotate files. For ex- 
ample, chfn, groupdadd, passwd, and useradd 
are programs that modify system files that store user 
information in /etc, kudzu is an program that de- 
tects and configures new and/or changed hardware in 
a system and requires to update its database stored in 
/etc/sysconfig/hwconf, and updfstab is de- 
signed to keep /etc/ fstab consistent with the devices 
plugged in the system. 

The obvious solution would be to refine the labels for 
files in /etc to eliminate these kinds of unnecessary and 
potentially-risky operations. 


5.2 Evaluating other Trusted Programs 


Table 2 shows a summary of the results from applying 
the PIDSI approach to eight SELinux trusted programs 
for which policy modules and packages are defined. The 
table shows: (1) trusted package, (2) file labels (SELinux 
types) used per package, (3) number of writers detected 
per type (Writers) and (4) exceptions. The integrity re- 
quirement assigned by default is high integrity for all 
types, except for the ones marked with **; because of the 
semantics associated to /var, various applications write 
to this directory, we assign low integrity requirement to 
var_log_t and var_run_t. 

The common system types (bin_t, etc-t, 
lib_t, man_t, sbin_t and usr_t) are marked 
with + in the last two columns. The results for these 
types are displayed in Table 3. The results show only 
two exceptions, none in Table 2 and two in Table 3. 

These reasons behind and resolutions for these excep- 
tions are shown in Table 4. One good resolution would 
be a refinement of the policies: programs should have 
particular labels for their files, even if they are installed 
in system directories, instead of using general system la- 
bels. The use of a general system label gives all system 
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Package SELinux Label Writers | Exception 
initrc_exec_t 8 no 
textrel_shlib_t 9 no 
Ipr_exec_t 8 no 
sane dbusd_etc_t 7 no 
system types T T 
var_log_t** 14 no 
var_run_t** 10 no 
var_spool_t 10 no 
avideaede dmidecode_exec_t 8 no 
system types T T 
locale_t 7 no 
initrc_exec_t 8 no 
hald_exec_t 8 no 
an dbusd_etc_t 7 no 
system types T T 
iptables_exec_t 8 no 
iptables initrc_exec_t 8 no 
system types T T 
locale_t 7 no 
kudzu initrc_exec_t 8 no 
system types T T 
initrc_exec_t 8 no 
NetworkManager_ 8 no 
Network var_run_t 
Manager NetworkManager_ 8 no 
exec_t 
dbusd_etc_t 7 no 
system types T T 
rpm_exec_t 8 no 
rpm_var_lib_t 7 no 
rpm 
system types T T 
var_spool_t 10 no 
sshd_exec_t 8 no 
daha sshd_var_run_t 8 no 
ssh_keygen_exec_t 8 no 
ssh_keysign_exec_t 8 no 

















Table 2: Results of applying the PIDSI approach to 
SELinux Trusted Packages. Columns with a ‘{’ are dis- 
played in table 3 


programs access to these files (case APP LABELS in Ta- 
ble 4). However, this option is not always possible, as 
sometimes a program actually requires access to system 
files. In such cases, the programs have to be trusted (case 
ADD in Table 4). For example, some trusted programs 
read information from the /etc/passwd file, so those 
subjects permitted to modify that file must be trusted. 
Only a small number of such programs must be trusted. 


6 Discussion 


Trusted programs may use system files, such as system 
libraries or the password file, in addition to the files pro- 
vided in their packages. Because some of our trusted pro- 
gram packages installed their own libraries under the sys- 
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SELinux Label | Writers | Exceptions 
bin_t 9 no 
etc_t 18 integrity 
lib_t 8 no 

man_t 8 integrity 
sbin_t 8 no 
usr_t 7 no 

















Table 3: System labels referenced by the packages pre- 
sented in Table 2. Only et c_t and man-_t have conflicts; 
the number of conflicting types per case can not be high 
(Writers column is an upper limit since it includes trusted 
writers), SO we can precisely examine each exception and 
suggest resolutions (shown in Table 4). 


tem label 1ib_t our analysis included system libraries. 
Therefore, application integrity not only depends on the 
integrity of the files in the installation package but also 
on some other files. In general, the files that the program 
execution depends on should be comprehensively identi- 
fied. These should be well-known per system. 

An issue is whether a trusted program may create a file 
whose integrity it depends upon that has a system label. 
For example, a trusted program generates the password 
file, but this used by the system, so it has a system label. 
We did not see a case where this happened for our trusted 
programs, but we believe that this is possible in practice. 
We believe that more information about the integrity of 
the contents generated by the program will need to be 
used in compliance testing. For example, if the program 
generates data it marks as high integrity, then we could 
leverage this in addition to package files and program 
policy labels to generate tamperproof goal policies. 

An issue with our approach is the handling of low in- 
tegrity program objects. Since low integrity program ob- 
jects are the lowest integrity objects in the system, any 
program can write to these objects. We find that we 
want low integrity program objects to be relative to the 
trusted programs; lower than all trusted programs, but 
still higher than system data. Further investigation is re- 
quired. 

The approach in this paper applies only to trusted pro- 
grams. We make no assumptions about the relationship 
between untrusted program and the system data. In fact, 
we are certain that there is system data that should not 
be accessed by most, if not all, untrusted programs. Note 
that there is no advantage to verifying the compliance of 
untrusted program, because the system does not depend 
on untrusted programs to enforce its security goals. Such 
programs have no special authority. 


7 Related Work 


Policy Analysis. Policies generally contain a consider- 
able number of rules that express how elements ina given 
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SELinux Conflicting Labels Type of Exception | Resolution Comment 

Label Method 

etc_t groupadd_t, passwd.t, | Integrity ADD The conflicting labels require access to the 
useradd_t, chfn_t the same file /etc/passwd 

etc_t updfstab_t, Integrity ADD The first two labels have legitimate rea- 
ricci_modstorage-t, sons to modify /etc/fstab. The last 
firstboot_t type modifies multiple files in /etc 

etc_t postgresql_t,.kudzu_t Integrity APP LABELS The conflicting types need access to appli- 

cation files labeled with system labels 

man_t system_crond_t Integrity REMOVE crond does not need to write manual pages 

ADD: Add conflicting types to the set of trusted readers (confidentiality) or writers (integrity). 

APP LABELS: The associated application requires access to a file that is application specific but was labeled using system 

labels. Adding application specific labels to handle those files solves the conflict. 

REMOVE: The permission requested is not required 








Table 4: Compliance Exceptions and Resolutions. This table details the exceptions to tamperproof compliance 
presented in Table 3. It shows the list of conflicting, untrusted subjects and the resolution method, per case. 


environment must be controlled. Because of the size of 
a policy and the relationships that emerge from having 
a large number of rules, it is difficult to manually eval- 
uate whether a policy satisfies a given property or not. 
As a consequence, tools to automatically analyze pol- 
icy are necessary. APOL [35], PAL [29], SLAT [10], 
Gokyo [16] and PALMS [15] are some of the tools de- 
veloped to analyze SELinux policies; however, each of 
these tools focuses on the analysis of single security poli- 
cies. Of these, only PALMS offers mechanisms to com- 
pare policies; in particular it addresses compliance eval- 
uation, but our approach to compliance is broader and 
allows the compliance problem to be automated. 

Policy Modeling. We need a formal model to reason 
about the features of a given policy. Such a model should 
be largely independent of particular representation of the 
targeted policies and should enable comparisons among 
different policies. Multiple models have been proposed 
and each one of them defines a set of components that 
need to be considered when translating a policy to an 
intermediate representation. Cholvy and Cuppens [6] fo- 
cus on permissions, obligations, prohibitions and provide 
a mechanism to check regulation consistency. Bertino et 
al. [3] focus on subjects, objects and privileges, as well 
as the organization of these components and the set of 
authorization rules that define the relationships among 
components and the set of derived rules that may be gen- 
erated because of a hierarchical organization. Kock et 
al. [18] represent policies as graphs with nodes that rep- 
resent components(processes, users, objects) and edges 
that represent rules and a set of constraints that globally 
applied to the system. In any case, policy modeling be- 
comes a building block in the process of evaluating com- 
pliance. Different policies must be translated to an inter- 
mediate representation (a common model) so they can be 
compared and their properties evaluated. 


Policy Reconciliation. Policy compliance problems 
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may resemble policy reconciliation problems. Given two 
policies A and B that define a set of requirements, a rec- 
onciliation algorithm looks for a specific policy instance 
C that satisfies the stated requirements. Policy compli- 
ance in a general sense, i.e. “Given a policy A and a pol- 
icy B, is B compliant with A?’ means ‘is any part of A 
in conflict with B?’. Previous work [21] shows that rec- 
onciliation of three or more policies is intractable. Com- 
pliance is also a intractable problem since this would re- 
quire to checking all possible paths in B against all possi- 
ble paths in A. Although both of these problems are sim- 
ilar in that they both test policy properties and are non- 
tractable in general cases (no restrictions), they differ in 
their inputs and expected outputs. While in the case of 
reconciliation, an instance that satisfies the requirements 
has to be calculated, in the case of compliance, policy 
instances are given and one is evaluated against the other 
one. 


Policy Compliance. The security-by-contract 
paradigm resembles our policy compliance model. It is 
one of the mechanisms proposed to support installation 
and execution of potentially malicious code from a third 
party in a local platform. Third party applications are 
expected to come with a security contract that specifies 
application behavior regarding security issues. The first 
step in the verification process is checking whether the 
behaviors allowed by the contract are also allowed by 
the local policy [8]. In the most recent project involving 
contract matching, contract and policy are security au- 
tomatons and the problem of contract matching becomes 
a problem of testing language inclusion for automatons. 
While there is no known polynomial technique to test 
language inclusion for non-deterministic automatons, 
determining language inclusion for deterministic au- 
tomatons is known to be polynomial [9]. One main 
advantage of our representation is that we are verifying 
policies that are actually implemented by the enforcing 
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mechanism, not high level statements that may not 
be actually implemented because of the semantic gap 
between specification and implementation. In addition, 
the enforcing mechanism is part of the architecture. 


8 Conclusion 


This work is driven by the idea of unifying application 
and system security policies. Since applications and sys- 
tems policies are independently developed, they use dif- 
ferent language syntax and semantics. As a consequence, 
it is difficult to prove or disprove that programs enforce 
system security goals. The emergence of mandatory ac- 
cess control systems and security typed languages makes 
it possible to automatically evaluate whether applications 
and systems enforce common security goals. We reshape 
this problem as a verification problem: we want to eval- 
uate if applications are compliant with system policies. 

We found that compliance verification involves two 
tasks: we must ensure that the system protects appli- 
cation from being tampered with, as well as verify that 
the application enforces system security goals. In or- 
der to automate the mapping between the program pol- 
icy and the system policy, we proposed the PIDSI (Pro- 
gram Integrity Dominates System Data Integrity) ap- 
proach. The PIDSI approach relies on the observation 
that in general program objects are higher integrity than 
system objects. We tested the trusted program core of the 
SELinux system to see if its policy was compatible with 
the PIDSI approach. We found that our approach accu- 
rately represents the SELinux security design with a few 
minor exceptions, and requires little or no feedback from 
administrators in order to work. 


Notes 


'The program verification (e.g., STL compilation) enforces the 
complete mediation guarantee. 

2 At present, module policies are not included in Linux packages, 
but RedHat, in particular, is interested in including SELinux module 
policies in its rpm packages in the future [36]. 

3SELinux uses the term type for its labels, as it uses an extended 
Type Enforcement policy [5]. 

4As described above, this must be done manually now, via 
semodule, but the intent is that when you load a package contain- 
ing a module policy, someone will install the module policy. 

5In this case, violating the confidentiality of SSH keys enables a 
large class of integrity attacks. This phenomenon has been discussed 
more generally by Sean Smith [31]. 
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Abstract 


Voting with cryptographic auditing, sometimes called 
open-audit voting, has remained, for the most part, a the- 
oretical endeavor. In spite of dozens of fascinating pro- 
tocols and recent ground-breaking advances in the field, 
there exist only a handful of specialized implementations 
that few people have experienced directly. As a result, 
the benefits of cryptographically audited elections have 
remained elusive. 

We present Helios, the first web-based, open-audit 
voting system. Helios is publicly accessible today: any- 
one can create and run an election, and any willing ob- 
server can audit the entire process. Helios is ideal for on- 
line software communities, local clubs, student govern- 
ment, and other environments where trustworthy, secret- 
ballot elections are required but coercion is not a serious 
concern. With Helios, we hope to expose many to the 
power of open-audit elections. 


1 Introduction 


Over the last 25 years, cryptographers have developed 
election protocols that promise a radical paradigm shift: 
election results can be verified entirely by public ob- 
servers, all the while preserving voter secrecy. These 
protocols are said to provide two properties: ballot cast- 
ing assurance, where each voter gains personal assur- 
ance that their vote was correctly captured, and universal 
verifiability, where any observer can verify that all cap- 
tured votes were properly tallied. Some have used the 
term “open-audit elections” to indicate that anyone, even 
a public observer with no special role in the election, can 
act as auditor. 

Unfortunately, there is a significant public-awareness 
gap: few understand that these techniques represent a 
fundamental improvement in how elections can be au- 
dited. Even voting experts who recognize that open-audit 
elections are “the way we’ll all vote in the future” seem 
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to envision a distant future, not one we should consider 
for practical purposes yet. The few implementations of 
open-audit elections that do exist [3, 2] have not had as 
much of an impact as hoped, in large part because they 
require special equipment and an in-person experience, 
thus limiting their reach. 


We present Helios, a web-based open-audit voting 
system. Using a modern web browser, anyone can set 
up an election, invite voters to cast a secret ballot, com- 
pute a tally, and generate a validity proof for the entire 
process. Helios is deliberately simpler than most com- 
plete cryptographic voting protocols in order to focus on 
the central property of public auditability: any group can 
outsource its election to Helios, yet, even if Helios is 
fully corrupt, the integrity of the election can be verified. 


Low-Coercion Elections. Voting online or by mail is 
typically insecure in high-stakes elections because of the 
coercion risk: a voter can be unduly influenced by an at- 
tacker looking over her shoulder. Some protocols [13] 
attempt to reduce the risk of coercion by letting voters 
override their coerced vote at a later (or earlier) time. In 
these schemes, the privacy burden is shifted from vote 
casting to voter registration. In other words, no matter 
what, some truly private interaction is required for coer- 
cion resistance. 


With Helios, we do not attempt to solve the coercion 
problem. Rather, we posit that a number of settings— 
student government, local clubs, online groups such as 
open-source software communities, and others—do not 
suffer from nearly the same coercion risk as high-stakes 
government elections. Yet these groups still need voter 
secrecy and trustworthy election results, properties they 
cannot currently achieve short of an in-person, physically 
observable and well orchestrated election, which is often 
not a possibility. We produced Helios for exactly these 
groups with low-coercion elections. 
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Trust no one for integrity, trust Helios for privacy. 
In cryptographic voting protocols, there is an inevitable 
compromise: unconditional integrity, or unconditional 
privacy. When every component is compromised, only 
one of those two properties can be preserved. In this 
work, we hold the opinion that the more important prop- 
erty, the one that gets people’s attention when they under- 
stand open-audit voting, is unconditional integrity: even 
if all election administrators are corrupt, they cannot con- 
vincingly fake a tally. With this design decision made, 
privacy is then ensured by recruiting enough trustees and 
hoping that a minimal subset of them will remain honest. 

In the spirit of simplicity, and because it is difficult 
to explain to users how privacy derives from the acts of 
multiple trustees, Helios takes an interesting approach: 
there is only one trustee, the Helios server itself. Pri- 
vacy is guaranteed only if you trust Helios. Integrity, of 
course, does not depend on trusting Helios: the election 
results can be fully audited even if all administrators — 
in this case the single Helios sever — is corrupt. Future 
versions of Helios may support multiple trustees. How- 
ever, exhibiting the power of universal verifiability can 
be achieved with this simpler setup. 


Our Contribution. In this work, we contribute the 
software design and an open-source, web-based imple- 
mentation of Helios, as well as a running web site that 
anyone can use to manage their elections at http: 
//neliosvoting.org. We do not claim any cryp- 
tographic novelty. Rather, our contribution is a combina- 
tion of existing Web programming techniques and cryp- 
tographic voting protocols to provide the first truly acces- 
sible open-audit voting experience. We believe Helios 
provides a unique opportunity to educate people about 
the value of cryptographic auditability. 


Limitations. While every major feature is functional, 
Helios is currently alpha software. As such, it requires 
Firefox 2 (or later). In addition, some aspects of the user 
interface, especially for administrative tasks, require sig- 
nificant additional polish and better user feedback on er- 
ror. These issues are being actively addressed. 


This Paper. In Section 2, we briefly review the Helios 
protocol, based on the Benaloh vote-casting approach [5] 
and the Sako-Kilian mixnet [16]. In Section 3, we cover 
some interesting techniques used to implement Helios in 
a modern Web browser. Section 4 covers the specifics of 
the Helios system and its use cases. We discuss, in Sec- 
tion 5, the security model, some performance metrics, 
and features under development. We reference related 
work in Section 6 and conclude in Section 7. 
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2 Helios Protocol 


This section describes the Helios protocol, which is 
most closely related to Benaloh’s Simple Verifiable Vot- 
ing protocol [5], which itself is partially inspired by the 
Sako-Kilian mixnet [16]. We claim no novelty, we only 
mean to be precise in the steps taken by voters, adminis- 
trators, and auditors, and we mean to provide enough de- 
tails for an able programmer to re-implement every por- 
tion of this protocol. 


2.1 Vote Preparation & Casting 


The key auditability feature proposed by Benaloh’s Sim- 
ple Verifiable Voting is the separation of ballot prepara- 
tion and casting. A ballot for an election can be viewed 
and filled in by anyone at any time, without authentica- 
tion. The voter is authenticated only at ballot casting 
time. This openness makes for increased auditability, 
since anyone, including an auditor not eligible to vote 
(e.g. someone from a political organization who has al- 
ready voted), can test the ballot preparation mechanism. 
The process is as follows between Alice, the voter, and 
the Ballot Preparation System (BPS): 


1. Alice begins the voting process by indicating in 
which election she wishes to participate. 


2. The BPS leads Alice through all ballot questions, 
recording her answers. 


3. Once Alice has confirmed her choices, the BPS en- 
crypts her choices and commits to this encryption 
by displaying a hash of the ciphertext. 


4. Alice can now choose to audit this ballot. The BPS 
displays the ciphertext and the randomness used to 
create it, so that Alice can verify that the BPS had 
correctly encrypted her choices. If this option is se- 
lected, the BPS then prompts Alice to generate a 
new encryption of her choices. 


5. Alternatively, Alice can choose to seal her ballot. 
The BPS discards all randomness and plaintext in- 
formation, leaving only the ciphertext, ready for 
casting. 


6. Alice is then prompted to authenticate. If success- 
ful, the encrypted vote, which the BPS committed 
to earlier, is recorded as Alice’s vote. 


Because we worry little about the possibility of co- 
ercion, Helios can be simpler than the Benaloh system 
that inspired it. Specifically, the BPS does not sign the 
ciphertext before casting, and we do not worry about Al- 
ice seeing the actual hash commitment of her encrypted 
vote before sealing. (The subtle reasons why these can 
lead to coercion are explained in [6].) 
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Teaching Voters about Coercion. While online-only 
voting is inherently coercible, few voters are aware of 
this issue: many US elections today are shifting to vote- 
by-mail without realizing the subtle but critical change 
in coercibility. We take the opportunity to make this is- 
sue clear with Helios by making coercion explicit: we 
provide a “Coerce Me!” button at ballot casting time 
which allows any voter to email a potential coercer the 
complete proof — ciphertext, randomness, and plaintext 
— of how they voted. This design choice does not enable 
new avenues for coercion; it only makes the existing co- 
ercibility more apparent. It is our hope that Helios can 
thus help educate voters about this critical election issue. 


2.2 Bulletin Board of Votes 


In all cryptographic voting protocols, a bulletin board 
is made publicly available. On this bulletin board, cast 
votes are displayed next to either a voter name or voter 
identification number. All subsequent data processing 
is also posted for the public to download and verify. A 
number of distributed bulletin board protocols, including 
consensus algorithms, have been proposed. 

In Helios, we forgo complexity and opt for the sim- 
plest possible bulletin board, run by a single server. We 
expect auditors to check the bulletin board’s integrity 
over time, and enough individual voters to check that 
their encrypted vote appears on the bulletin board. Once 
again, we opt for this simplification in order to focus the 
user on the major advantage of the system: Helios is au- 
ditable by anyone, including watchdog organizations and 
individual voters themselves. 


2.3 Sako-Kilian/Benaloh Mixnet 


In cryptographic voting protocols that wish to preserve 
individual ballots and potentially support write-in votes, 
anonymization is typically achieved by way of a mixnet, 
where trustees each shuffle and re-randomize the cast ci- 
phertexts before jointly decrypting them. Both the shuf- 
fling and decryption of encrypted ballots are accompa- 
nied by proofs of correctness. 

We use the Sako-Kilian protocol [16], the first prov- 
able mixnet based on El-Gamal re-encryption. We note 
that Benaloh uses a very similar technique [5]. We chose 
this scheme because of its simplicity and ease of expla- 
nation, even though we know of more complex proto- 
cols [15] that achieve an order of magnitude better per- 
formance for the same assurance of integrity. 


El Gamal. Recall the El-Gamal encryption scheme im- 
plemented to support semantic security: a large (1024 
bits) prime p is selected, such that p = 2q + 1 with q 
also prime. A generator g of the g-order subgroup of 
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Zz is selected. A secret key x € Z, is selected, and 
the corresponding public key y = g* mod p is com- 
puted. A message m in the q-order subgroup of Z> 
is then encrypted by selecting r € Z, and computing: 
c = (a,Z) = (g", my"). Decryption is computed as 
m=a “sp. 

When m € Z,, meaning that it is not necessarily in 
the g-order subgroup of Z*, a simple mapping from Z, to 
the g-order subgroup of Z> is used: on input m, compute 
mo = m-+ 1 and, if més = 1 mod p, output mp, other- 
wise output —mp9 mod p. Upon decryption, one obtains 
m, and the reverse mapping is achieved as follows: if 
m <q, set Mo = mM, otherwise mp = —m mod p, and 
output mo —1. Using these techniques, we can efficiently 
encrypt and decrypt messages in Z, for ga 512-bit prime. 
This is the natural path for message encryption, as a typi- 
cal plaintext can be any string of bits up to a certain size. 


Re-encryption. The El-Gamal cryptosystem offers 
simple re-encryption, even when using the Z, mapping 
given above. Given a ciphertext c = (a, 3), a ciphertext 
c’ can be computed by selecting s € Z, and computing 
c’ = (g*a,y*). It is clear that c’ and c decrypt to the 
same plaintext, c with randomness r and c’ with random- 
ness r+ s. 


Sako-Kilian Shuffle & Proof. In the Sako-Kilian 
mixnet, all inputs are El-Gamal ciphertexts. A mix server 
takes N inputs, re-encrypts them using re-encryption 
factors {s;}j;¢[1,.) and permutes them according to ran- 
dom permutation 7, so that d; = Reenc(c,(;), 8:). 

To prove that it mixed its inputs correctly, a mix server 
produces a second, “shadow mix,” as illustrated in Figure 
1. The verifier then challenges the mix server to reveal 
the permutation and re-encryption factors for either this 
shadow mix or the difference between the two mixes, i.e. 
the shuffle that would transform the shadow mix outputs 
into the primary mix outputs. An honest mix server can 
obviously answer either challenge, while a cheating mix 
server can answer at most one of those questions con- 
vincingly, and is thus caught with at least 50% proba- 
bility. To increase the assurance of integrity, we ask the 
mix server to produce a few shadow mixes. The verifier 
then provides the appropriate number of challenge bits, 
one for each shadow mix. If the mix server succeeds 
at responding to all challenges, then the primary mix is 
correct with probability 1 — 2—-*, where t is the number 
of shadow mixes. Choosing t = 80 guarantees integrity 
with overwhelming probability. 

In Helios, we need a non-interactive proof: there are 
many verifiers and we do not wish to perform such heavy 
computation for everyone who requests it. The proof 
protocol described above, which is Honest-Verifier Zero- 
Knowledge (HVZK), is thus transformed using the Fiat- 
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Figure 1: The “Shadow-Mix” Shuffle Proof. The mix server 
creates a secondary mix. If challenged with bit 0, it reveals this 
secondary mix. If challenged with bit 1, it reveals the “differ- 
ence” between the two mixes. 


Shamir heuristic [9]: the challenge bits are computed as 
the hash of all shadow mixes. Note how this approach is 
only workable if we have enough shadow mixes to pro- 
vide an overwhelming probability of integrity: if there 
is a non-negligible probability of cheating, a cheating 
prover can produce many shadow mixes until it finds a 
set whose hash provides just the right challenge bits to 
cheat. 


Proof of Decryption. Once an El Gamal cipher- 
text is decrypted, this decryption can be proven using 
the Chaum-Pedersen protocol [8] for proving discrete- 
logarithm equality. Specifically, given a ciphertext c = 
(a, 3) and claimed plaintext m, the prover shows that 


logy(y) = loga(8/m): 


e The prover selects w € Z, and sends A = g”, B= 
a” to the verifier. 


e The verifier challenges with c € Zg. 
e The prover responds with t = w + xc. 


e The verifier checks that g’ = Ay‘ and a’ = 


B(G/m)°. 


It is clear that, given c and t, A and B can be easily com- 
puted, thus providing for simulated transcripts of such 
proofs indicating Honest-Verifer Zero-Knowledge. It is 
also clear that, if one could rewind the protocol and ob- 
tains prover responses for two challenge values against 
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the same A and B, the value of x would be easily solv- 
able, thus indicating that this is a proof of knowledge of 
the discrete log and that log, (y) = loga(3/m). 

As this protocol is HVZK with overwhelming proba- 
bility of catching a cheating prover, it can be transformed 
safely into non-interactive form using the Fiat-Shamir 
heuristic. We do exactly this in Helios to provide for 
non-interactive proofs of decryption that can be posted 
publicly and re-distributed by observers. 


2.4 The Whole Process 


The entire Helios protocol thus unfolds as follows: 


1. Alice prepares and audits as many ballots as she 
wishes, ensuring that all of the audited ballots are 
consistent. When she is satisfied, Alice casts an en- 
crypted ballot, which requires her to authenticate. 


2. The Helios bulletin board posts Alice’s name and 
encrypted ballot. Anyone, including Alice, can 
check the bulletin board and find her encrypted vote 
posted. 


3. When the election closes, Helios shuffles all en- 
crypted ballots and produces a non-interactive proof 
of correct shuffling, correct with overwhelming 
probability. 


4. After a reasonable complaint period to let auditors 
check the shuffling, Helios decrypts all shuffled 
ballots, provides a decryption proof for each, and 
performs a tally. 


5. An auditor can download the entire election data 
and verify the shuffle, decryptions, and tally. 


If an election is made up of more than one race, then 
each race is treated as a separate election: each with its 
own bulletin board, its own independent shuffle and shuf- 
fle proof, and its own decryptions. This serves to limit 
the possibility of re-identifying voters given long ballots 
where any given set of answers may be unique in the set 
of cast ballots. 


3 Web Components 


We have clearly stated that Helios values integrity first, 
and voter privacy second. That said, Helios still takes 
great care to ensure voter privacy, using a combination 
of modern Web programming techniques. Once the bal- 
lot is loaded into the browser, all candidate selections are 
recorded within the browser’s memory, without any fur- 
ther network calls until the ballot is encrypted and the 
plaintext is discarded. In this section, we cover the Web 
components we use to accomplish this goal. 
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3.1 Single-Page Web Application 


A number of Web applications today are called “single- 
page applications” in that the page context and its URL 
never change. Gmail [10] is a prime example: clicks 
cause background actions rather than full-page loads. 
The technique behind this type of Web application is the 
use of JavaScript to handle user clicks: 





<a onclick="do_stuff()" href="#">Do Stuff</a> 











When a user clicks the “Do Stuff” link, no new page 
is loaded. Instead, the JavaScript function do_stuff () 
is invoked. This function may make network requests 
and update the page’s HTML, but, importantly, the page 
context, including its JavaScript scope, is preserved. 

For our purposes, the key point is that, if all necessary 
data is pre-loaded, the do_stuff() function may not 
need to make any network calls. It can update some of 
its scope, read some of its pre-loaded data, and update 
the rendered HTML user interface accordingly. This is 
precisely the approach we use for our ballot preparation 
system: the browser loads all election parameters, then 
leads the voter through the ballot without making any 
additional network requests. 


The jQuery JavaScript Library. Because we ex- 
pect auditors to take a close look at our browser-based 
JavaScript code, it is of crucial importance to make this 
code as concise and legible as possible. For this purpose, 
we use the jQuery JavaScript library, which provides 
flexible constructs for accessing and updating portions 
of the HTML Document Object Model (DOM) tree, ma- 
nipulating JavaScript data structures, and making asyn- 
chronous network requests (i.e. AJAX). An auditor is 
then free to compare the hash of the jQuery library we 
distribute with that of the official distribution from the 
jQuery web site. 


JavaScript-based Templating. Also important to the 
clarity of our browser-based code is the level of inter- 
mixing of logic and presentation: when all logic is im- 
plemented in JavaScript, it is tempting to intermix small 
bits of HTML, which makes for code that is particularly 
difficult to follow. Instead, we use the jQuery JavaScript 
Templating library. Then, we can bind a template to a 
portion of the page as follows: 





S("#main") .setTemplateURL ( 
"/templates/election.html" 








i 
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which connects to an HTML template with variable 
placeholders: 





<p> 
The election hash is {$T.election.hash}. 
</p> 











The code can, at a later point, render this template with a 
parameter and without any additional network access: 





S("#main") .processTemplate ( 
{’election’: election_object} 
i 











3.2 Cryptography in the Browser with 
LiveConnect 


JavaScript is a complete programming language in which 
it is possible to build a multi-precision integer library. 
Unfortunately, JavaScript performance for such compu- 
tationally intensive operations is poor. Thankfully, it 
is possible in modern browsers to access the browser’s 
Java Virtual Machine from JavaScript using a technology 
called LiveConnect. This is particularly straightforward 
in Firefox, where one can write the following JavaScript 
code: 





var a = new java.math.BigInteger (42); 
document.write (a.toString()); 











and then, from JavaScript still, invoke all of Java’s 
BigInteger methods directly on the object. Modu- 
lar exponentiation is a single call, modPow (), and El- 
Gamal encryption runs fast enough that it is close to im- 
perceptible to the average user. LiveConnect is slightly 
more complicated to implement in Internet Explorer and 
Safari, though it can be done [18]. 


3.3. Additional Tricks 


Data URIs. At times in the Helios protocol, we need 
to produce a printable receipt when the plaintext vote has 
not yet been cleared from memory. In order to open a 
new window ready for printing without network access, 
we use data URIs [14], URIs that contain information 
without requiring a network fetch: 





<a target="_new 
href="data:text/plain, YourS20Receipt..."> 
receipt 
</a> 
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Dynamic Windows. When data URIs are not available 
(e.g. Internet Explorer), we can open a new window us- 
ing JavaScript, set its MIME type to text/plain, and 
dynamically write its content from the calling frame. 





var receipt = window.open()j; 

receipt .document.open ("text/plain"); 
receipt .document.write (content) ; 
receipt .document.close(); 











In Safari and Firefox, this approach yields a new window 
in a slightly broken state: the contents cannot be saved to 
disk. However, in Internet Explorer, the only browser 
that does not support Data URIs, the dynamic window 
creation works as expected. Thus, in Firefox and Safari, 
Helios uses Data URIs, and in Internet Explorer it uses 
dynamic windows. 


JSON. As we expect that auditors will want to down- 
load election, voter, and bulletin board data for pro- 
cessing and verifying, we need a data format that is 
easy to parse in most programming languages, includ- 
ing JavaScript. XML is one possibility, but we found 
that JavaScript Object Notation (JSON) is easier to han- 
dle with far less parsing code. JSON allows for data rep- 
resentation using JavaScript lists and associative arrays. 
For example, a list of voters and their encrypted votes 
can be represented as: 





[ 
: ‘Alice’, 
: ‘Bob’, ‘vote’ 


Motel 2 123423006" }, 
: '823848....'}, 


{’name’ 
{’name’ 











Libraries exist in all major programming languages for 
parsing and generating this data format. In particular, the 
format maps directly to arrays and objects in JavaScript, 
lists and dictionaries in Python, lists and hashes in Ruby. 


4 Helios System Description 


We are now ready to discuss the details of the Helios sys- 
tem. We begin with a description of the back-end server 
architecture. We then consider the four use cases: creat- 
ing an election, voting, tallying, and auditing. 


4.1 Server Architecture 


The Helios back-end is a Web application written in the 
Python programming language [17], running inside the 
CherryPy 3.0 application server, with a Lighttpd web 
server. All data is stored in a PostgreSQL database. 
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All server-side logic is implemented in Python, with 
HTML templates rendered using the Cheetah Templat- 
ing engine. Many back-end API calls return JSON data 
structures using the Simplejson library, and the voting 
booth server-side template is, in fact, a single-page web 
applications including JavaScript logic and jTemplate 
HTML/JavaScript templates. 


Application Software. We use the Python Cryptogra- 
phy Toolkit for number theory utilities such as prime 
number and random number generation. We imple- 
mented our own version of El-Gamal in Python, given 
our specific need for re-encryption, which is typically 
not supported in cryptographic libraries. We note that 
improved performance could likely be gained from opti- 
mizing our first-pass implementation. 


Server Hardware. We host an alpha version of the 
Helios software at http://heliosvoting.org. 
The server behind that URL is a virtual Ubuntu Linux 
server operated by SliceHost. For the tests performed 
in Section 5.3, we used a small virtual host with 256 
megabytes of RAM and only a fraction of a Xeon proces- 
sor, at a cost of $20/month. A larger virtual host would 
surely provide better performance, but we wish to show 
the practicality of Helios even with modest resources. 


4.2 Creating an Election 


Only registered Helios users can create elections. Reg- 
istration is handled like most typical web sites: 


e auser enters an email address, a name, and a desired 
password. 

e an email with an embedded confirmation link is sent 
to the given email address. 

e the user clicks on the confirmation link to activate 
his account. 


A registered user then creates an election with an elec- 
tion name, a date and time when voting is expected to 
begin, and a date and time when voting is expected to 
end. Upon creation, Helios generates and stores a new 
El-Gamal keypair for the election. Only the public key is 
available to the registered user: Helios keeps the private 
key secret. The user who created the election is consid- 
ered the administrator. 


Setting up the Ballot. The election is then in “build 
mode,” where the ballot can be prepared, reviewed, and 
tweaked by the administrative user, as shown in Figure 
2. The user can log back in over multiple days to adjust 
any aspect of the ballot. 
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Build an Election: JFK High School Student Government 2008 


back to the election 


Questions 


1. [X] President 
2. [*] [X] Secretary 


New Question 


Figure 2: The Helios Election Builder lets an administrative user create and edit ballot questions in a simple web-based interface. 
The administrative user can log out and back in at any time to update the election. 


Managing Voters. The administrative user can add, 
update, and remove voters at will, as shown in Figure 3. 
A voter is identified by a name and an email address, and 
is specific to a given election. Helios generates a ran- 
dom 10-character password automatically for each voter. 
At any time, the administrator can email voters using the 
Helios administrative interface. These emails will au- 
tomatically contain the voter’s password, though the ad- 
ministrator will not see this password at any time. 


Freezing the Election. When ready, the administrative 
user freezes the election, at which point the voter list, the 
election start and end dates, and the ballot details become 
immutable and available for download in JSON form. 
The administrative user receives an email from Helios 
with the SHA1 hash of this JSON object. The election is 
ready for voters to cast ballots. The administrative user 
will typically email voters using the Helios administra- 
tive interface to let them know that the polls are open. 


4.3 Voting 


Alice, a voter in a Helios election, receives an email let- 
ting her know that the polls are open. This email con- 
tains her username (i.e. her email address), her election- 
specific password, the SHA1 hash of the election param- 
eters, and the URL that directs her to the Helios voting 
booth, as illustrated in Figure 4. It is important to note 
that this URL does not contain any identifying informa- 
tion: it only identifies the election, as per the vote-casting 
protocol in Section 2.1. 
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The Voting Booth. When Alice follows the voting 
booth URL, Helios responds with a single-page web 
application. This application, now running in Alice’s 
browser, displays a “loading...” message while it down- 
loads the election parameters and templates, including 
the El-Gamal public key and questions. The page then 
displays the election hash prominently, and indicates that 
no further network connections will be made until Alice 
submits her encrypted ballot. (Alice can set her browser 
to “offline” mode to enforce this.) Every transition is 
then handled by a local JavaScript function call and its 
associated templates. Importantly, the JavaScript code 
can decide precisely what state to maintain and what 
state to discard: the “back” button is not relevant. This is 
illustrated in Figure 5. 


Filling in the Ballot. Alice can then fill in the bal- 
lot, selecting the checkbox by each desired candidate 
name, using the “next” and “previous” buttons to nav- 
igate between questions. Each click is handled by 
JavaScript code which records Alice’s choices in the lo- 
cal JavaScript scope. If Alice tries to close her browser 
or navigate to a different URL, she receives a warning 
that her ballot will be cleared. 


Sealing. After Alice has reviewed her options, she can 
choose to “seal” her ballot, which triggers the JavaScript 
code to encrypt her selection with computationally inten- 
sive operations performed via LiveConnect. The SHA1 
hash of the resulting ciphertext is then displayed, as 
shown in Figure 6. 
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Technology 


JFK High School Student Government 2008 — Voters 


back to election 


Email 


alice@heliosvoting.org 


bob@heliosvoting.org 


charlie@example.com 











Figure 3: The Helios voter management interface. 
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JFK High School Student Government 2008: Email Voters 


You are about to email all voters in this election. You can fill in the middle part of the email: 


Dear [VOTER_NAME], 





It’s time to vote for your student government! 


Follow the link below before March 1st. 








i 





Your email: [VOTER_EMAIL] 


Your password: [VOTER PASSWORD) 
send now! 








e088 An Invitation to Vote in JFK High School Student Government 2008 Oo 
7] x <= z a ial 
a. 4 8 @ tits § Of 2O 
GetMail Write Address Book Decrypt Reply ReplyAll Forward Delete Junk Print Stop 
® Subject: An Invitation to Vote in JFK High School Student Government 2008 
From: Helios <helios@adida.net> ¥ 


Date: 4:48 PM 
To: Alice <alice@adida.net> ¥ 





Dear Alice, 

It's time to vote for your student government! 

Follow the link below before March ist. 

Voting URL: http://heliosvoting.org/election/vote?election_id=1 


Your email address: alice@adida.net 
Your password: XSrTVQvUen 


-Helios 








‘Z_(S)ave (Chopy 





Figure 4: The administrative user can send emails to all voters. Each voter receives her password, which the administrative user 


does not see. 
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Figure 5: The Helios Voting Booth. 
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Your sealed ballot 


Your vote has now been encrypted, and 
the fingerprint of your encrypted vote 
is: 


ctYZy//huzKYmaxk9epa7oWvFyQ 
[Your Receipt] [Your Receipt by Email] 


If sin choose to cast the ballot, all 
plaintext information will be deleted 
from your browser's memory.Then, you 
will be prompted for your email and 
Password. 





cast 











You can choose to audit your vote, 
which will show you how your 
options were encrypted. If you do 
so, you will then have to re-vote, 
since, in a secret ballot, you cannot 
obtain proof of how you voted. 


audit 














However, since this is an online 
voting system, you can email a proof 
of your vote to your friends. After 
all, online voting systems are 
coercible, so why not make coercion 
as easy as clicking a button? 

Coerce Me! 





























Figure 6: Sealing a Helios ballot. 
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Auditing. Alice can opt to audit her ballot with the 
“Audit” button, in which case the JavaScript code reveals 
the randomness used in encrypting Alice’s choices. Alice 
can save this data to disk and run her own code to ensure 
the encryption was correct, or she can use the Python 
Ballot Encryption Verification (BEV) program provided 
by Helios. 

Once Alice chooses to audit her ballot and the audit- 
ing information is rendered, the JavaScript code clears 
its encrypted ballot data structures and returns Alice to 
the confirmation screen, where she can either update her 
choices or choose to seal her options again with different 
randomness and thus a different ciphertext. 


Casting. If Alice chooses instead to cast her ballot, the 
JavaScript code clears the plaintext and randomness from 
its scope, and presents Alice with a login prompt for 
her email address and password. (If Alice had set her 
browser to “offline” mode, she should bring it back on- 
line now that all plaintext information is cleared.) When 
Alice submits her login information, the JavaScript code 
intercepts the form submission and submits the email, 
password, and encrypted vote in a background call, so 
that any errors, e.g. a mistyped password, can be re- 
ported without clearing the JavaScript scope and thus the 
encrypted ballot. When a success code is returned by 
the Helios server, the JavaScript code can clear its entire 
scope and display a success message. On the server side, 
Helios emails Alice with a confirmation of her encrypted 
vote, including its SHA1 hash. 


Coerce Me! As explained in Section 2, Helios pro- 
vides a “Coerce Me!” button to make it clear that online 
voting is inherently coercible. This button appears af- 
ter ballot sealing, next to the “audit” and “cast” options. 
When clicked, Helios opens up a new window with a 
mailto: URL that triggers Alice’s email client to open 
a composition window containing the entire ballot infor- 
mation, including plaintext and randomness that prove 
how the ciphertext was formed. Unlike the “Audit” step, 
which forces Alice to create a new ciphertext, “Coerce 
Me!” allows Alice to continue and cast that very same 
encrypted vote for which she obtained proof of encryp- 
tion. The distinction between these two steps highlights 
the difference between a coercion-free auditing process 
that could potentially be used with in-person voting, and 
the inherent coercibility of online-only voting which is 
made more explicit with the “Coerce Me!” button. 


4.4 Anonymization 


Once the voting period ends, Helios enables the 
anonymization, decryption, and proof features for the 
administrative user. Selecting “shuffle” will begin the 
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re-encryption and permutation process. Then, select- 
ing “shuffle proof” will trigger the mixnet proof with 80 
shadow mixes. The administrative user can then opt for 
“decrypt”, which will decrypt the shuffled ciphertexts, 
and “decrypt proof”, which will generate proofs for each 
such decryption. Finally, the administrative user can se- 
lect “tally” to count up the decrypted votes. 

All of these operations are performed on the server 
side, in Python code. The results are stored in the 
database and made available for download in JSON form. 
Once all proofs are generated and the result is tallied, 
the server deletes the permutation, randomness, and se- 
cret key for that election. All that is left is the encrypted 
votes, their shuffling, the resulting decryptions, and the 
publicly verifiable proofs of integrity. The entire elec- 
tion can still be verified, though no further proofs can be 
generated. 


4.5 Auditing 


Helios provides two verification programs, one for ver- 
ifying a single encrypted vote produced by the ballot 
preparation system with the “audit” option selected, and 
another for verifying the shuffling, decryption, and tally- 
ing of an entire election. Both programs are written in 
Python using the Simplejson library for JSON process- 
ing, but otherwise only raw Python operations. 


Verifying a Single Vote. The Ballot Encryption Veri- 
fication program takes as input the JSON data structure 
returned by the voting booth audit process. This data 
structure contains a plaintext ballot, its ciphertext, the 
randomness used to encrypt it, and the election ID. The 
program downloads the election parameters based on the 
election ID and outputs: 


e the hash of the election, which the voter can check 
against that displayed by the voting booth, 


e the hash of the ciphertext, which the voter can check 
against the receipt she obtained before requesting an 
audit, 


e the verified plaintext of the ballot. 


Verifying an Election. The Election Tallying Verifica- 
tion program takes, as input, an election ID. It down- 
loads the election parameters, the bulletin board of cast 
votes, shuffled votes, shuffle proofs, decrypted votes, and 
decryption proofs. The verification program checks all 
proofs, then re-performs the tally based on the decryp- 
tions. It eventually outputs the list of voters and their re- 
spective encrypted ballot hashes, plus the verified tally. 
This information can be reposted by the auditor, so that 
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if enough auditors check and re-publish the cast ballot 
hashes and tally, participants can be confident that their 
vote was correctly captured, and that the tally was cor- 
rectly performed. 


5 Discussion 


Helios is simpler than most cryptographic voting proto- 
cols because it focuses on proving integrity. As a com- 
promise, Helios makes weaker guarantees of privacy. In 
this section, we review in greater detail the type of elec- 
tion for which we expect this compromise to be appropri- 
ate, as well as the security model, performance metrics, 
and future extensions we can make to improve Helios on 
both fronts. 


5.1 The Need for Verifying Elections with 
Low Coercion Risk 


It is legitimate to question whether there truly exist elec- 
tions that require the high levels of verifiability afforded 
by cryptography, while eschewing coercion-resistance 
altogether. In fact, we believe that, for a number of on- 
line communities that rarely or never meet in the same 
physical place: 


1. coercion-resistance is futile from the start, given the 
remote nature of the voting process, and 

2. cryptographic end-to-end verifiability is the only vi- 
able means of ensuring any level of integrity. 


Specifically, with respect to the auditing argument, 
how could a community member remotely verify any- 
thing at all pertaining to the integrity of an election pro- 
cess? Open-source software is insufficient: the voter 
doesn’t know which software is actually running on the 
election server, short of deploying hardware-rooted attes- 
tation. Physical observation of a chain-of-custody pro- 
cess is already ruled out by the online-only nature of the 
community. Cryptographic verifiability, though it seems 
stronger than absolutely necessary, is the only viable 
option when only the public inputs and outputs—never 
the “guts”—of the voting process can be truly observed. 
Cryptographic auditing may be a big hammer, but it is 
the only hammer. 

For the same reason, we believe the pedagogical value 
of a system like Helios is particularly strong. The con- 
trast between classic and open-audit elections is partic- 
ularly apparent in this online setting. With Helios, the 
voter’s ability is transformed, from entirely powerless 
and forced to trust a central system, to empowered with 
the ability to ensure that one’s vote was correctly cap- 
tured and tallied, without trusting anyone. 
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5.2 Security Model & Threats 


We accept the risk that, if someone compromises the He- 
lios server before the end of an election, the secrecy of 
individual ballots may be compromised. On the other 
hand, we claim that, assuming enough auditors, even a 
fully corrupted Helios cannot cheat the election result 
without a high chance of getting caught. We now explore 
various attacks and how we expect them to be handled. 


Incorrect Shuffling or Decryption. A corrupt Helios 
server may attempt to shuffle votes incorrectly or de- 
crypt shuffled votes incorrectly. Given the overwhelming 
probability of catching these types of attacks via crypto- 
graphic verification, it takes only one auditor to detect 
this kind of tampering. 


Changing a Ballot or Impersonating a Voter. A cor- 
rupt Helios may substitute a new ciphertext for a voter, 
replacing his cast vote or injecting a vote when a voter 
doesn’t cast one in the first place. Even if the ballot sub- 
mission server is eventually hosted separately and dis- 
tributed among trustees, a corrupt Helios server knows 
the username and password for all users, and can thus 
easily authenticate and cast a ballot on behalf of a user. 
In this case, all of the shuffling and decryption verifica- 
tions will succeed, because the corruption occurs before 
the encryption step. 

In the current implementation of Helios, we hope to 
counter these attacks through extensive auditing. Previ- 
ous analyses [7] have shown that it takes only a small 
random sample of voters who verify their vote to defeat 
this kind of attack. To encourage voters to audit their 
votes, we created the Election Tallying Verification pro- 
gram, available in well commented source form. The 
Election Tallying Verification program outputs a copy of 
all cast ballots, so that auditors can post this information 
independently. We expect multiple auditors to follow this 
route and re-publish the complete list of encrypted bal- 
lots along with their re-computed election outcome. This 
auditing may include re-contacting individual voters and 
asking them to verify the hash of their cast encrypted bal- 
lot. We also expect that a large majority of voters, maybe 
all voters, in fact, will answer at least one auditor who 
prompts them to verify their cast encrypted vote. 


Corrupting the Ballot. A corrupt Helios may present 
a corrupt ballot to Alice, making her believe that she’s se- 
lecting one candidate when actually she is voting for an- 
other. This kind of attack would defeat the hashed-vote 
bulletin-board verification, even with multiple auditors, 
since Alice receives an entirely incorrect receipt during 
the ballot casting process. Helios mitigates this risk by 
authenticating users only after the ballot has been filled 


17th USENIX Security Symposium 345 


346 


out, so users cannot be individually targeted with corrupt 
ballots as easily. However, a corrupt Helios may authen- 
ticate voters first (voters may not notice), or use other 
information (e.g. IP address) to identify voters and target 
certain victims for ballot corruption. 

To counter this attack, we provide the Ballot Encryp- 
tion Verification program, again in source form for au- 
ditors to verify. This program can be run by individual 
voters when they choose to audit a handful of votes be- 
fore they choose to truly cast one. Alternatively, auditors, 
even auditors who are not eligible to vote in the election, 
can prepare ballots and audit them at will. 


Auditing is Crucial. It should be clear from these de- 
scriptions that Helios counters attacks through the power 
of auditing. In addition to the raw tally, Helios publishes 
a list of voter names and corresponding encrypted votes. 
Helios then provides supporting evidence for the tally, 
given the cast encrypted votes, in the form of a mixnet- 
and-decryption proof. Verification programs are avail- 
able in source form for anyone to review the integrity of 
the results. 

However, only the individual voters can check the va- 
lidity of the cast encrypted ballots. It is expected that 
multiple auditors will check the proof and, when sat- 
isfied, republish the tally and the list of cast encrypted 
ballots, where voters can check that their ballot was cor- 
rectly recorded. Helios ensures that, if a large majority 
of voters verifies their vote, then the outcome is correct. 
However, if voters do not verify their cast ballot, Helios 
does not provide any verification beyond classic voting 
systems. 

These expectations are somewhat tautological: voter- 
verified elections function only when at least some frac- 
tion of the voters are willing to participate in the verifi- 
cation process made available to them. Elections can be 
made verifiable, but only voters can actually verify that 
their secret ballot was correctly recorded. 


5.3. Performance 


For all performance measurements, we used the server 
hardware described in the previous section, and, on the 
client side, a 2.2Ghz Macintosh laptop running Firefox 2 
over a home broadband connection. We note that perfor- 
mance of Firefox 2 was greatly increased when running 
on virtualized Linux on the same laptop, indicating that 
our measurements are likely a worst-case scenario given 
platform-specific performance peculiarities of Firefox. 


Java Virtual Machine Startup. The Java Virtual Ma- 
chine requires startup time. Our rough measurements in- 
dicate anywhere between 500ms and 1.5s on our client 
machine. During this time, the browser appears to freeze 
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and user input is suspended. To an uninformed user, this 
is a usability impediment which will require further user 
testing. That said, it is a behavior we can easily warn 
users about before starting up the Ballot Preparation Sys- 
tem, and because this happens only once per user session 
— not once per ballot — it is not too onerous. 


Timing Measurements. We experimented with a 2- 
question election and 500 voters. All timings were per- 
formed a sufficient number of times to obtain a stable 
average mostly free of testing noise. Note that time mea- 
surements that pertain to a set of ballots are expected to 
scale linearly with the number ballots and the number of 
questions in the election. Our results are presented in 
Figure 7. 
































| Operation Time 

Ballot Encryption, in browser 

|p| = 1024 bits sella 
| Shuffling, on server 133s 
| Shuffle Proof, on server 3 hours 
| Decryption, on server 71s 
| Decryption Proof, on server 210s 
| Complete Audit, on client 4 hours 








Figure 7: Timing Measurements 


The Big Picture. It takes only a few minutes of com- 
putation to obtain results for a 500-voter election. The 
shuffle proof and verification steps require a few hours, 
and are thus, by far, the most computation-intensive por- 
tions of the process. We note that both of these steps 
are highly parallelizable and thus could be significantly 
accelerated with additional hardware. 


5.4 Extensions 


There are many future directions for Helios. 


Support for Other Types of Election. Helios cur- 
rently supports only simple elections where Alice selects 
1 or more out of the proposed candidates. Adding write- 
ins and rank-based voting, as well as the associated tal- 
lying mechanisms, could prove useful. Helios may also 
eventually offer homomorphic-based tabulation, as they 
are often easier to explain and verify, though they would 
made greater demands of browser-based cryptography. 


Browser-Based Verification. The current verification 
process for the ballot encryption step is a bit tedious, re- 
quiring the use of a browser and a Python program. We 
could write a JavaScript-only verification program that 
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could be provided directly by auditors while running en- 
tirely in the voter’s browser to check that Helios is deliv- 
ering authentic ballots. There are some issues to deal 
with, notably cross-domain requests, but it does seem 
possible and desirable to accomplish browser-only bal- 
lot encryption verification. 

Similarly, it is certainly possible to audit an entire elec- 
tion using JavaScript and LiveConnect for computation- 
ally intensive operations. Letting auditors deliver the 
source code for these verification programs would allow 
any voter to audit the entire process straight from their 
browser. 


Distributing the Shuffling and Decryption. For im- 
proved privacy guarantees, Helios can be extended to 
support shuffling and decryption by multiple trustees. 
The Helios server would then only focus on provid- 
ing the bulletin board and voting booth functionality. 
Trustees would be provided with standalone Python pro- 
grams that perform threshold key generation, partial 
shuffling and threshold decryption. They could individ- 
ually audit the program’s source code. With these exten- 
sions, Helios would resemble classic cryptographic vot- 
ing protocols more closely, and would provide stronger 
privacy guarantees. 


Improving Authentication. Currently, our protocol 
requires that most voters audit their cast ballot, otherwise 
the Helios server could impersonate voters and change 
the election outcome. Future version of Helios should 
consider offloading authentication to a separate authenti- 
cation service. If feasible with browser-based cryptogra- 
phy, Helios should use digital signatures to authenticate 
each ballot in a publicly verifiable manner. 


6 Related Work 


There is a plethora of theoretical cryptographic voting 
work reviewed and cited in [11, 4]. We do not attempt to 
re-document this significant body of work here. 


Open-audit voting implementations. There are only 
a small handful of notable open-audit voting implemen- 
tations. VoteHere’s advanced protocols for mixnets and 
coercion-free ballot casting [3] have been implemented 
and deployed in test environments. The Punchscan vot- 
ing system [2] has also been implemented and used in a 
handful of real student government elections, with video 
evidence available for all to see. 


Browser-based cryptography. Cryptographic con- 
structs have been implemented in browser-side code in 
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many different settings. In the late 1990s, Hushmail be- 
gan providing web-based encrypted email using a Java 
applet. A couple of years later, George Danezis showed 
how to use LiveConnect for fast JavaScript-based cryp- 
tography, and the EVOX voting project [12] used similar 
technology to encrypt votes in a blind-signature-based 
scheme. The Stanford SRP project [18] also uses Live- 
Connect for browser-based cryptography and indicates 
how one can get LiveConnect to work in browsers other 
than Firefox. The recent Clipperz Crypto Library [1] 
provides web-based cryptography in pure JavaScript, in- 
cluding a multi-precision integer library. 


7 Conclusion 


Helios is the first publicly available implementation of 
a web-based open-audit voting system. It fills an inter- 
esting niche: elections for small clubs, online communi- 
ties, and student governments need trustworthy elections 
without the significant overhead of coercion-freeness. 
We hope that Helios can be a useful educational resource 
for open-audit voting by providing a valuable service — 
outsourced, verifiable online elections — that could not be 
achieved without the paradigm-shifting contributions of 
cryptographic verifiability. 
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Abstract 


Commercial electronic voting systems have experienced 
many high-profile software, hardware, and usability fail- 
ures in real elections. While it is tempting to abandon 
electronic voting altogether, we show how a careful ap- 
plication of distributed systems and cryptographic tech- 
niques can yield voting systems that surpass current sys- 
tems and their analog forebears in trustworthiness and us- 
ability. We have developed the VoteBox, a complete elec- 
tronic voting system that combines several recent e-voting 
research results into a coherent whole that can provide 
strong end-to-end security guarantees to voters. VoteBox 
machines are locally networked and all critical election 
events are broadcast and recorded by every machine on 
the network. VoteBox network data, including encrypted 
votes, can be safely relayed to the outside world in real 
time, allowing independent observers with personal com- 
puters to validate the system as it is running. We also 
allow any voter to challenge a VoteBox, while the election 
is ongoing, to produce proof that ballots are cast as in- 
tended. The VoteBox design offers a number of pragmatic 
benefits that can help reduce the frequency and impact of 
poll worker or voter errors. 


1 Introduction 


Electronic voting is at a crossroads. Having been aggres- 
sively deployed across the United States as a response 
to flawed paper and punch-card voting in the 2000 U.S. 
national election, digital-recording electronic (DRE) vot- 
ing systems are themselves now seen as flawed and un- 
reliable. They have been observed in practice to pro- 
duce anomalies that may never be adequately explained— 
undervotes, ambiguous audit logs, choices “flipping” be- 
fore the voter’s eyes. Recent independent security reviews 
commissioned by the states of California and Ohio have 
revealed that every DRE voting system in widespread use 
has severe deficiencies in design and implementation, ex- 
posing them to a wide variety of vulnerabilities; these sys- 
tems were never engineered to be secure. As a result, 
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many states are now decertifying or restricting the use of 
DRE systems. 

Consequently, DREs are steadily being replaced with 
systems employing optical-scan paper ballots. Op-scan 
systems still have a variety of problems, ranging from ac- 
cessibility issues to security flaws in the tabulation sys- 
tems, but at least the paper ballots remain as evidence 
of the voter’s original intent. This allows voters some 
confidence that their votes can be counted (or at least re- 
counted) properly. However, as with DRE systems, if er- 
rors or tampering occur anywhere in this process, there is 
no way for voters to independently verify that their ballots 
were properly tabulated. 

Regardless, voters subjectively prefer DRE voting sys- 
tems [15]. DREs give continuous feedback, support many 
assistive devices, permit arbitrary ballot designs, and so 
on. Furthermore, unlike vote-by-mail or Internet voting, 
DRES, used in traditional voting precincts, provide privacy, 
protecting voters from bribery or coercion. We would ide- 
ally like to offer voters a DRE-style voting system with ad- 
ditional security properties, including: 


1. Minimized software stack 

2. Resistance to data loss in case of failure or tampering 

3. Tamper-evidence: a record of election day events 
that can be believably audited 

4. End-to-end verifiability: votes are cast as intended 
and counted as cast 


The subject of this paper is the VorEBox, a complete 
electronic voting system that offers these essential prop- 
erties as well as a number of other advantages over exist- 
ing designs. Its user interface is built from pre-rendered 
graphics, reducing runtime code size as well as allow- 
ing the voter’s exact voting experience to be examined 
well before the election. VoreBoxes are networked in a 
precinct and their secure logs are intertwined and repli- 
cated, providing robustness and auditability in case of fail- 
ure, misconfiguration, or tampering. While all of these 
techniques have been introduced before, the novelty of 
this work lies in our integration of these parts to achieve 
our architectural security goals. 
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Notably, we use a technique adapted from Benaloh’s 
work on voter-initiated auditing [4] to gain end-to-end 
verifiability. Our scheme, which we term immediate bal- 
lot challenge, allows auditors to compel any active voting 
machine to produce proof that it has correctly captured 
the voter’s intent. With immediate challenges, every sin- 
gle ballot may potentially serve as an election-day test of 
a VoreBox’s correctness. We believe that the VorEBox ar- 
chitecture is robust to the kinds of failures that commonly 
occur in elections and is sufficiently auditable to be trusted 
with the vote. 

In the next section we will present background on the 
electronic voting problem and the techniques brought to 
bear on it in our work. We expand on our design goals 
and describe our VoTrEBox architecture in Section 3, and 
share details of our implementation in Section 4. The pa- 
per concludes with Section 5. 


2 Background 


2.1 Difficulties with electronic voting 


While there have been numerous reports of irregularities 
with DRE voting systems in the years since their introduc- 
tion, the most prominent and indisputable problem con- 
cerned the ES&S iVotronic DRE systems used by Sarasota 
County, Florida, in the November 2006 general election. 
In the race for an open seat in the U.S. Congress, the mar- 
gin of victory was only 369 votes, yet over 18,000 votes 
were officially recorded as “undervotes” (i.e., cast with no 
selection in this particular race). In other words, 14.9% 
of the votes cast on Sarasota’s DREs for Congress were 
recorded as being blank, which contrasts with undervote 
rates of 1-4% in other important national and statewide 
races. While a variety of analyses were conducted of the 
machines and their source code [18, 19, 51], the official 
loser of the election continued to challenge the results 
until a Congressional investigation failed to identify the 
source of the problem [3]. Whether the ultimate cause 
was mechanical failure of the voting systems or poor hu- 
man factors of the ballot design, there is no question that 
these machines failed to accurately capture the will of 
Sarasota’s voters [2, 14, 20, 25, 34, 36, 37, 50]. 

While both security flaws and software bugs have re- 
ceived significant attention, a related issue has also ap- 
peared numerous times in real elections using DREs: op- 
erational errors and mistakes. In a 2006 primary election 
in Webb County, Texas—the county’s first use of ES&S 
iVotronic DRE systems—a number of anomalies were dis- 
covered when, as in Sarasota, a close election led to le- 
gal challenges to the outcome [46]. Test votes were acci- 
dentally counted in the final vote tallies, and some ma- 
chines were found to have been “cleared” on election 
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day, possibly erasing votes. More recently, in the Jan- 
uary, 2008 Republican presidential primary in South Car- 
olina, several ES&S iVotronic systems were incorrectly 
configured subsequent to pre-election testing, resulting in 
those machines being inoperable during the actual elec- 
tion. “Emergency” paper ballots ran out in many precincts 
and some voters were told to come back later [11]. 

All of these real-world experiences, in conjunction with 
recent highly critical academic studies, have prompted 
a strong backlash against DRE voting systems or even 
against the use of computers in any capacity in an elec- 
tion. However, computers are clearly beneficial. 

Clearly, computers cannot be trusted to be free of tam- 
pering or bugs, nor can poll workers and election officials 
be guaranteed to always operate special-purpose comput- 
erized voting systems as they were intended to be used. 
Our challenge, then, is to reap the benefits that computers 
can offer to the voting process without being a prisoner to 
their costs. 


2.2 Toward software independence 


Recently, the notion of software independence has been 
put forth by Rivest and other researchers seeking a way 
out of this morass: 


A voting system is software-independent if an 
undetected change or error in its software can- 
not cause an undetectable change or error in an 
election outcome. [41] 


Such a system produces results that are verifiably cor- 
rect or incorrect irrespective of the system’s implementa- 
tion details; any software error, whether malicious or be- 
nign, cannot yield an erroneous output masquerading as a 
legitimate cast ballot. 

Conventionally, the only way to achieve true software 
independence is to allow the voter to directly inspect, and 
therefore confirm to be correct, the actual cast vote record. 
Since we cannot give voters the ability to read bits off 
a flash memory card, nor can we expect them to men- 
tally perform cryptographic computations, we are limited 
in practice to paper-based vote records, which can be di- 
rectly inspected. 

Optical-scan voting systems, in which the voter marks 
a piece of paper that is both read immediately by an elec- 
tronic reader/tabulator and reserved in case of a manual 
audit, achieve this goal at the cost of sacrificing some 
of the accessibility and feedback afforded by DREs. The 
voter-verifiable paper audit trail (VVPAT) allows a DRE to 
create a paper record for the voter’s inspection and for 
use in an audit, but it has its own problems. Adding print- 
ers to every voting station dramatically increases the me- 
chanical complexity, maintenance burden, and failure rate 
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of those machines. A report on election problems in the 
2006 primary in Cuyahoga County, Ohio found that 9.6% 
of VVPAT records were destroyed, blank, or “compromised 
in some way” [23, p. 93]. 

Even if the voter’s intent survives the printing process, 
the rolls of thermal paper used by many current VVPAT 
printers are difficult to audit by hand quickly and accu- 
rately [22]. It is also unclear whether voters, having al- 
ready interacted with the DRE and confirmed their choices 
there, will diligently validate an additional paper record. 
(In the same Cuyahoga primary election, a different re- 
port found that voters in fact did not know they were sup- 
posed to open a panel and examine the printed tape under- 
neath [1, p.50].) 


2.2.1 Reducing the trusted computing base 


While the goal of complete software independence is 
daunting, the state of the art in voting research approaches 
it by drawing a line around the set of functions that are 
essential to the correctness of the vote and aggressively 
evicting implementation from that set. If assurance can 
come from reviewing and auditing voting software, then it 
should be easier to review and ultimately gain confidence 
in a smaller software stack. 

Pre-rendered user interface (PRUI) is an approach to re- 
ducing the amount of voting software that must be re- 
viewed and trusted [53]. Exemplified by Pvote [52], a 
PRUI system consists of a ballot definition and a software 
system to present that ballot. The ballot definition com- 
prises a state machine and a set of static bitmap images 
corresponding to those states; it represents what the voter 
will see and interact with. The software used in the vot- 
ing machine acts as a virtual machine for this ballot “pro- 
gram.” It transitions between states and sends bitmaps to 
the display device based on the voter’s input (e.g., touch- 
screen or keypad). The voting VM is no longer responsi- 
ble for text rendering or layout of user interface elements; 
these tasks are accomplished long in advance of election 
day when the ballot is defined by election officials. 

A ballot definition of this sort can be audited for cor- 
rectness independently of the voting machine software 
or the ballot preparation software. Even auditors with- 
out knowledge of a programming language can follow the 
state transitions and proofread the ballot text (already ren- 
dered into pixels). The voting machine VM should still be 
examined by software experts, but this code—critical to 
capturing the user’s intent—is reduced in size and there- 
fore easier to audit. Pvote comprises just 460 lines of 
Python code, which (even including the Python interpreter 
and graphics libraries) compares favorably against current 
DREs: the Accu Vote TS involves over 31,000 lines of C++ 
running atop Windows CE [52]. The system we describe 
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in Section 3 applies the PRUI technique to reduce its own 
code footprint. 

Sastry et al. [47] describe a system in which program 
modules that must be trusted are forced to be small and 
clearly compartmentalized by dedicating a separate com- 
puter to each. The modules operate on isolated CPUs and 
memory, and are connected with wires that may be ob- 
served directly; each module may therefore be analyzed 
and audited independently without concern that they may 
collude using side channels. Additionally, the modules 
may be powered off and on between voters to eliminate 
the possibility of state leaking from voter to voter. (Sec- 
tion 4.1 shows how we approximate this idea in software.) 


2.2.2 The importance of audit logs 


Even trustworthy software can be misused, and this prob- 
lem occurs with unfortunate regularity in the context of 
electronic voting. We expect administrators to correctly 
deploy, operate, and maintain large installations of unfa- 
miliar computer systems. DRE vendors offer training and 
assistance, but on election day there is typically very little 
time to wait for technical support while voters queue up. 

In fact, the operational and procedural errors that can 
(and do) occur during elections is quite large. Machines 
unexpectedly lose power, paper records are misplaced, 
hardware clocks are set wrong, and test votes (see §2.2.3 
below) are mingled with real ballots. Sufficient trauma to 
a DRE may result in the loss of its stored votes. 

In the event of an audit or recount, comprehensive 
records of the events of election day are essential to estab- 
lishing (or eroding) confidence in the results despite these 
kinds of election-day mishaps. Many DREs keep elec- 
tronic audit logs, tracking election day events such as “the 
polls were opened” and “a ballot was cast,’ that would 
ideally provide this sort of evidence to post facto auditing 
efforts. Unfortunately, current DREs entrust each machine 
with its own audit logs, making them no safer from failure 
or accidental erasure than the votes themselves. Similarly, 
the audit logs kept by current DREs offer no integrity safe- 
guards and are entirely vulnerable to attack; any malicious 
party with access to the voting machine can trivially alter 
the log data to cover up any misdeeds. 

The Auprtorium [46] system confronts this problem by 
using techniques from distributed systems and secure log- 
ging to make audit logs into believable records. All vot- 
ing machines in a polling place are connected in a private 
broadcast network; every election event that would con- 
ventionally be written to a private log is also “announced” 
to every voting machine on the network, each of which 
also logs the event. Each event is bound to its origina- 
tor by a digital signature, and to earlier events from other 
machines via a hash chain. The aggressive replication 
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protects against data loss and localized tampering; when 
combined with hash chains, the result is a hash mesh [48] 
encompassing every event in the polling place. An at- 
tacker (or an accident) must now successfully compro- 
mise every voting machine in the polling place in order to 
escape detection. (In Section 3 we describe how VorEBox 
uses and extends the AupiTorIuM voting protocol.) 


2.2.3 Logic and accuracy testing; parallel testing 


Regrettably, the conventional means by which voting ma- 
chines are deemed trustworthy is through testing. Long 
before election day, the certification process typically in- 
volves some amount of source code analysis and test- 
ing by “independent testing authorities,’ but these pro- 
cesses have been demonstrably ineffective and insuffi- 
cient. Logic and accuracy (L&A) testing is a common 
black-box testing technique practiced by elections offi- 
cials, typically in advance of each election. L&A testing 
typically takes the form of a mock election: a number of 
votes are cast for different candidates, and the results are 
tabulated and compared against expected values. The goal 
is to increase confidence in the predictable, correct func- 
tioning of the voting systems on election day. 

Complementary to L&A is parallel testing, performed 
on election day with a small subset of voting machines se- 
lected at random from the pool of “live” voting systems. 
The units under test are sequestered from the others; as 
with L&A testing, realistic votes are cast and tallied. By 
performing these tests on election day with machines that 
would otherwise have gone into service, parallel testing is 
assumed to provide a more accurate picture of the behav- 
ior of other voting machines at the same time. 

The fundamental problem with these tests is that they 
are artificial: the conditions under which the test is per- 
formed are not identical to those of a real voter in a real 
election. It is reasonable to assume that a malicious piece 
of voting software may look for clues indicating a test- 
ing situation (wrong day; too few voters; evenly-spread 
voter choices) and behave correctly only in such cases. A 
software bug may of course have similar behavior, since 
faulty DREs may behave arbitrarily. We must also take 
care that a malicious poll worker cannot signal the testing 
condition to the voting machine using a covert channel 
such as a “secret knock” of user interface choices. 

Given this capacity to “lay low” under test, the problem 
of fooling a voting machine into believing it is operat- 
ing in a live vote-capture environment is paramount [26]. 
Because L&A testing commonly makes explicit use of a 
special code path, parallel testing is the most promising 
scenario. It presents its own unique hazard: if the test 
successfully simulates an election-day environment, any 
votes captured under test will be indistinguishable from 
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legitimate ballots cast by real voters, so special care must 
be taken to keep these votes from being included in the 
final election tally. 


2.3 Cryptography and e-voting 


Many current DREs attempt to use encryption to protect 
the secrecy and integrity of critical election data; they uni- 
versally fail to do so [6, 8, 24, 32]. Security researchers 
have proposed two broad classes of cryptographic tech- 
niques that go beyond simple encryption of votes (sym- 
metric or public-key) to provide end-to-end guarantees to 
the voter. One line of research has focused on encrypt- 
ing whole ballots and then running them through a series 
of mix-nets [9] that will re-encrypt and randomize ballots 
before they are eventually decrypted (see, e.g., [43, 35]). 
If at least one of the mixes is performed correctly, then 
the anonymity of votes is preserved. This approach has 
the benefit of tolerating ballots of arbitrary content, al- 
lowing its use with unconventional voting methods (e.g., 
preferential or Condorcet voting). However, it requires a 
complex mixing procedure; each stage of the mix must 
be performed by a different party (without mutual shared 
interest) for the scheme to be effective. 

As we will show in Section 3, VorEBox employs ho- 
momorphic encryption [5] in order to keep track of each 
vote. A machine will encrypt a one for each candidate (or 
issue) the voter votes for and a zero elsewhere. The ho- 
momorphic property allow the encrypted votes for each 
candidate to be summed into a single total without being 
decrypted. This approach, also used by the Adder [30] 
and Civitas [12] Internet e-voting systems, typically com- 
bines the following elements: 


Homomorphic Tallying The encryption system allows 
encrypted votes to be added together by a third 
party without knowledge of individual vote plain- 
texts. Many ciphers, including El Gamal public key 
encryption, can be designed to have this property. 
Anyone can verify that the final plaintext totals are 
consistent with the sum of the encrypted votes. 


Non-Interactive Zero Knowledge (NIZK) proofs In any 
voting system, we must ensure that votes are well 
formed. For example, we may want to ensure that 
a voter has made only one selection in a race, or 
that the voter has not voted multiple times for the 
same candidate. With a plain-text ballot containing 
single-bit counters (i.e., 0 or 1 for each choice) this 
is trivial to confirm, but homomorphic counters ob- 
scure the actual counter’s value with encryption. By 
employing NIZKs [7], a machine can include with its 
encrypted votes a proof that each vote is well-formed 
with respect to the ballot design (e.g., at most one 
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candidate in each race received one vote, while all 
other candidates received zero votes). Moreover, the 
attached proof is zero-knowledge in the sense that the 
proof reveals no information that might help decrypt 
the encrypted vote. Note that although NIZKs like this 
can prevent a voting machine from grossly stuffing 
ballots, they cannot prevent a voting machine from 
flipping votes from one candidate to another. 


The Bulletin Board A common feature of most crypto- 
graphic voting systems is that all votes are posted for 
all the world to see. Individual voters can then verify 
that their votes appear on the board (e.g., locating a 
hash value or serial number “receipt” from their vot- 
ing session within a posted list of every encrypted 
vote). Any individual can then recompute the homo- 
morphic tally and verify its decryption by the elec- 
tion authority. Any individual could likewise verify 
the NIZKs. 


2.4 Non-cryptographic techniques 


In response to the difficult in explaining cryptography 
to non-experts and as an intellectual exercise, cryptog- 
raphers have designed a number of non-cryptographic 
paper-based voting systems that have end-to-end secu- 
rity properties, including ThreeBallot [39, 40], Punch- 
Scan [17], Scantegrity', and Prét 4 Voter [10, 42]. These 
systems allow voters to express their vote on paper and 
take home a verifiable receipt. Ballots are complicated 
with multiple layers, scratch-off parts, or other additions 
to the traditional paper voting experience. A full analysis 
of these systems is beyond the scope of this paper. 


3 Design 


We now revisit our design goals from Section 1 and dis- 
cuss their implementation in VoreBox, our complete pro- 
totype voting system. 


3.1 User interface 


Goals achieved: DRE-like user experience; minimized 
software stack 


A recent study [15] bolsters much anecdotal evidence sug- 
gesting that voters strongly prefer the DRE-style electronic 
voting experience to more traditional methods. Cleaving 
to the DRE model (itself based on the archetypical comput- 
erized kiosk exemplified by bank machines, airline check- 
in kiosks, and the like), VorEBox presents the voter with 
a ballot consisting of a sequence of pages: full screens 
containing text and graphics. The only interactive ele- 
ments of the interface are buttons: rectangular regions of 
the screen attached to either navigational behavior (e.g., 
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“go to next page”) or selection behavior (“choose candi- 
date X”’). (VoTEBox supports button activation via touch 
screen and computer mouse, as well as keyboards and as- 
sistive technologies). An example VoteBox ballot screen 
is shown in Figure 1. 


This simple interaction model lends itself naturally to 
the pre-rendered user interface, an idea popularized in the 
e-voting context by Yee’s Pvote system [52, 53]. A pre- 
rendered ballot encapsulates both the logical content of 
a ballot (candidates, contests, and so forth) and the en- 
tire visual appearance down to the pixel (including all 
text and graphics). Generating the ballot ahead of time 
allows the voting machine software to perform radically 
fewer functions, as it is no longer required to include any 
code to support text rendering (including character sets, 
Unicode glyphs, anti-aliasing), user interface element lay- 
out (alignment, grids, sizing of elements), or any graphics 
rendering beyond bitmap placement. 

More importantly, the entire voting machine has no 
need for any of these functions. The only UI-related ser- 
vices required by VoreBox are user input capture (in the 
form of (x, y) pairs for taps/clicks, or keycodes for other 
input devies) and the ability to draw a pixmap at a given 
position in the framebuffer. We therefore eliminate the 
need for a general-purpose GUI window system, dramati- 
cally reducing the amount of code on the voting machine. 


In our pre-rendered design, the ballot consists of a set 
of image files, a configuration file which groups these im- 
age files into pages (and specifies the layout of each page), 
and a configuration file which describes the abstract con- 
tent of the ballot (such as candidates, races, and proposi- 
tions). This effectively reduces the voting machine’s user 
interface runtime to a state machine which behaves as fol- 
lows. Initially, the runtime displays a designated initial 
page (which should contain instructional information and 
navigational components). The voter interacts with this 
page by selecting one of a subset of elements on the page 
which have been designated in the configuration as be- 
ing selectable. Such actions trigger responses in VoteBox, 
including transitions between pages and commitment of 
ballot choices, as specified by the ballot’s configuration 
files. The generality of this approach accommodates ac- 
cessibility options beyond touch-screens and visual feed- 
back; inputs such as physical buttons and sip-and-puff 
devices can be used to generate selection and navigation 
events (including “advance to next choice’’) for VorEBox. 
Audio feedback could also be added to VoreBox state 
transitions, again following the Pvote example [52]. 


We also built a ballot preparation tool to allow election 
administrators to create pre-rendered ballots for VoTEBox. 
This tool, a graphical Java program, contains the layout 
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Figure 1: Sample VOTEBox page. The voter sees (i); a schematic for the page is shown in (ii); a subset of the pixmaps used to produce 


(i) are shown, along with their corresponding IDs, in (iii). 


and rendering logic that is omitted from VoreBox. In ad- 
dition to clear benefits that come from reducing the com- 
plexity of the voting machine, this also pushes many of 
the things that might change from election to election or 
from state to state out of the voting machine. For exam- 
ple, Texas requires a straight-ticket voting option while 
California forbids it. With VorEBox, the state-specific be- 
havior is generated by the ballot preparation tool. This 
greatly simplifies the software certification process, as 
testing labs would only need to consider a single version 
of VoreBox rather than separate versions customized for 
each state’s needs. Local groups interested in the election 
could then examine the local ballot definitions for correct- 
ness, without needing to trust the ballot preparation tool. 


3.2 Auditorium 


Goals achieved: Defense against data loss; tamper- 
evident audit logs 


The failures described in Section 2 indicate that voting 
machines cannot be trusted to store their own data—or, 
at least, must not be solely trusted with their own data. 
We observe that modern PC equipment is sufficiently in- 
expensive to be used as a platform for e-voting (and note 
that most DREs are in fact special-purpose enclosures and 
extensions on exactly this sort of general-purpose hard- 
ware). VoreBox shares with recent peer-to-peer systems 
research the insight that modern PCs are noticeably over- 
provisioned for the tasks demanded of them; this is partic- 
ularly true for e-voting given the extremely minimal sys- 
tem requirements of the user interface described in Sec- 
tion 3.1. Such overpowered equipment has CPU, disk, 
memory, and network bandwidth to spare, and VorEBox 
puts these to good use addressing the problem of data loss 
due to election-day failure. 

Our design calls for all VorEBoxes in a polling place 
to be joined together in a broadcast network? as set forth 
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in our earlier work on Auprrorrum [46]. An illustration 
of this technique can be found in Figure 2. The polling 
place network is not to be routable from the Internet; in- 
deed, an air gap should exist preventing Internet packets 
from reaching any VoreBoxes. We will see in Section 3.3 
how data /eaving the polling place is essential to our com- 
plete design; such a one-way linkage can be built while 
retaining an air gap [27]. 

Each voting machine on the network broadcasts every 
event it would otherwise record in its log. As a result, the 
loss of a single VoreBox cannot result in the loss of its 
votes, or even its record of other election events. As long 
as a single voting machine survives, there will be some 
record of the votes cast that day. 


Supervisor console. We can treat broadcast log mes- 
sages as communication packets, with the useful side ef- 
fect that these communications will be logged by all par- 
ticipating hosts. VorEBox utilizes this feature of Auprro- 
RIUM to separate machine behavior into two categories: (1) 
features an election official would need to use, and (2) fea- 
tures a voter would need to use. This dichotomy directly 
motivates our division of VorEBox into two software ar- 
tifacts: (1) the VorEBox “booth” (that is, the voting ma- 
chine component that the voter interacts with, as described 
in Section 3.1), and (2) the “supervisor” console. 


The supervisor is responsible for the coordination of 
all election-day events. This includes opening the polls, 
closing the polls, and authorizing a vote to be captured at 
a booth location. For more practical reasons (because the 
supervisor console should run on a machine in the polling 
place that only election officials have physical access to, 
and, likewise, because election officials should never need 
to touch any other machine in the polling place once the 
election is running), this console also reports the status of 
every other machine in the polling place (including not 


USENIX Association 
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¢ Monitors, displays booth status 
¢ Broadcasts vote authorization 
e Records all broadcast messages 


booth G cette teeters a.. 
e Listen for vote authorizations 
¢ Capture voter selections 


e Broadcast encrypted votes 
e Are stateless & swappable at any time 
¢ Record all broadcast messages 





debeeeeued su pe rvisor 
(backup) 


e Ready to assume supervisor’s 
responsibilities at any time 
¢ Records all broadcast messages 


are 
CBo 


<a> 


geezectcencude (voter) 


Figure 2: Voting in the Auditorium. VoreBoxes are connected in a broadcast network. All election events (including cast ballots) are 
replicated to every voting machine and entangled with hash chaining. A supervisor console allows poll workers to use the AupIroRIUM 
channel to distribute instructions to voting machines (such as “you are authorized to cast a ballot”) such that those commands also enter 


the permanent, tamper-evident record. 


only connectivity status, but also various “vital sign” in- 
formation, such as its battery power). During the course 
of an election day, poll workers are able to conduct the 
election entirely from the supervisor console. 


In addition, as an intended design decision, the separa- 
tion of election control (on the supervisor console) from 
voting (at the VoreBox booth) fundamentally requires that 
every important election event be a network communica- 
tion. Because we only allow this communication to hap- 
pen in the form of Auprtorium broadcast messages, these 
communications are always logged by every participating 
VoreBox host (supervisors and booths included). 


Hash chaining and tamper evidence. Avpirortum also 
provides for hash chaining of log entries; when combined 
with broadcast replication, the result is a lattice of hash 
values that entangles the timelines of individual voting 
machines. This technique, adapted from the field of se- 
cure audit logging [33, 48], yields strong evidence of tam- 
pering or otherwise omitted or modified records. No at- 
tacker or failure can alter any individual log entry with- 
out invalidating all subsequent hashes in the record. We 
prevent attackers from performing this attack in advance 
or arrears of the election by bookending the secure log: 
before the polls open, a nonce (or “launch code’”’) is dis- 
tributed, perhaps by telephone, to each polling place; this 
nonce is inserted into the beginning of the log. Simi- 
larly, when the polls are closed, election supervisors can 
quickly publish the hash of the completed log to prevent 
future tampering. 
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3.3 Cast ballots and immediate ballot challenge 
Goals achieved: End-to-end verifiability 


In VoreBox, cast ballots are published in the global Au- 
DITORIUM log, implicitly revealing the contents of the cast 
ballot to any party privy to the log data. This, of course, 
includes post-election auditors seeking to verify the va- 
lidity and accuracy of the result, but it also could include 
partisans seeking proof of a bribed voter’s choice (or some 
other sort of malicious activity). In fact, the contents of 
the cast ballot need to be encrypted (in order to preserve 
anonymity), but they also need to fit into a larger software 
independent design. That is, if the software (because of 
bugs or malice) corrupts a ballot before encrypting it, this 
corruption must be evident to the voter. 

An end-to-end verifiable voting system is defined as 
one that can prove to the voter that (1) her vote was cast 
as intended and (2) her vote was counted as cast. Our de- 
sign provides a challenge mechanism, which can verify 
the first property, along with real-time public dissemina- 
tion of encrypted votes, which can satisfy the second. 


Counters. We begin by encoding a cast ballot as an n- 
tuple of integers, each of which can be 1 or 0. Each ele- 
ment of the n-tuple represents a single choice a voter can 
make, nis the number of choices, and a value of 1 encodes 
a vote for the choice while 0 encodes a vote against the 
choice. (In the case of propositions, both “yes” and “no” 
each appear as a single “choice,” and in the case of candi- 
dates, each candidate is a single “choice.”) The cast ballot 
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structure needs not be organized into races or contests; 
it is simply an opaque list of choice values. We define 
each element as an integer (rather than a bit) so that bal- 
lots can be homomorphically combined. That is, ballots 
A = (ao,q,...) and B = (bo, bj,...) can be summed to- 
gether to produce a third ballot S = (a9 + bo, a; + b4,...), 
whose elements are the total number of votes for each 
choice.* 


Homomorphic encryption of counters. VoTEBox uses 
an El Gamal variant that is additively homomorphic to en- 
crypt ballots before they are cast. Each element of the 
tuple is independently encrypted. The encryption and de- 
cryption functions are defined as follows: 








E(c,r,g°) = (8",(e'f) 
Dg'.8"f),a) = a 
Die’.g"f).) = 


where f and g are group generators, c is the plaintext 
counter, r is randomly generated at encryption time, a is 
the decryption key, and g” is the public encryption key. 
To decrypt, a party needs either a or r in order to con- 
struct g“". (g’, which is given as the first element of the 
cipher tuple, can be raised to a, or g“, which is the public 
encryption key, can be raised to r.) After constructing g“, 
the decrypting party should divide the second element of 
the cipher tuple by this value, resulting in f°. 

To recover the counter’s actual value c, we must invert 
the discrete logarithm f°, which of course is difficult. As 
is conventional in such a situation, we accelerate this task 
by precomputing a reverse mapping of f* — x for 0 < 
x < M (for some large M) so that for expected integral 
values of c the search takes constant time. (We fall back 
to a linear search, starting at M+ 1, if c is not in the table.) 

We now show that our encryption function is additively 
homomorphic by showing that when two ciphers are mul- 
tiplied, their corresponding counters are added: 

E(e1,n)OE(ca,72) = (8,8 f") © (g?, gf?) 
(gntr git) peitcay 


Immediate ballot challenge. To allow the voter to ver- 
ify that her ballot was cast as intended, we need some way 
to prove to the voter that the encrypted cipher published 
in the Auprtorium log represents the choices she actually 
made. This is, of course, a contentious issue wrought with 
negative human factors implications. 

We term our solution to the first requirement of end-to- 
end verifiability “immediate ballot challenge,’ borrowing 
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an idea from Benaloh [4]. A voter should be able (on 
any arbitrary ballot) to challenge the machine to produce 
a proof that the ballot was cast as intended. Of course, 
because these challenges generally force the voting ma- 
chine to reveal information that would compromise the 
anonymity of the voter, challenged ballots must be dis- 
carded and not counted in the election. A malicious vot- 
ing system now has no knowledge of which ballots will 
be challenged, so it must either cast them all correctly or 
risk being caught if it misbehaves. 


Our implementation of this idea is as follows. Before 
a voter has committed to her vote, in most systems, she 
is presented with a final confirmation page which offers 
two options: (1) go back and change selections, or (2) 
commit the vote. Our system, like Benaloh’s, adds one 
more page at the end, giving the voter the opportunity to 
challenge or cast a vote. At this point, Benaloh prints a 
paper commitment to the vote. VoreBox will similarly en- 
crypt and publish the cast ballot before displaying this fi- 
nal “challenge or cast” screen. If the voter chooses to cast 
her vote, VorEBox simply logs this choice and behaves 
as one would expect, but if the voter, instead, chooses to 
challenge VorEBox, it will publish the value for r that it 
passed to the encryption function (defined in equation 1) 
when it encrypted the ballot in question. Using equation 
1 and this provided value of r, any party (including the 
voter) can decrypt and verify the contents of the ballot 
without knowing the decryption key. An illustration of 
this sequence of events is in Figure 3. 


In order to make this process immediate, we need a way 
for voters (or voter advocates) to safely observe AupitTo- 
RIUM traffic and capture their own copy of the log. It is 
only then that the voter will be able to check, in real time, 
that VorEBox recorded and encrypted her preferences cor- 
rectly. To do this, we propose that the local network con- 
structed at the polling place be connected to the public In- 
ternet via a data diode [27], a physical device which will 
guarantee that the information flow is one way. * This 
connectivity will allow any interested party to watch the 
polling location’s AupiTorium traffic in real time. In fact, 
any party could provide a web interface, suitable for ac- 
cess via smart phones, that could be used to see the voting 
challenges and perform the necessary cryptography. This 
arrangement is summarized in Figure 4. Additionally, on 
the output side of the data diode, we could provide a stan- 
dard Ethernet hub, allowing challengers to locally plug in 
their own auditing equipment without relying on the elec- 
tion authority’s network infrastructure. Because all Aupi- 
TORIUM messages are digitally signed, there is no risk of 
the challenger being able to forge these messages. 
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Implications of the challenge scheme. 
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Figure 3: Challenge flow chart. As the voter advances past the review screen to the final confirmation screen, VorEBox commits to the 
state of the ballot by encrypting and publishing it. A challenger, having received this commitment (the encrypted ballot) out-of-band (see 
Figure 4), can now invoke the “challenge” function on the VorEBox, compelling it to reveal the contents of the same encrypted ballot. 
(A voter will instead simply choose “‘cast”’.) 
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Figure 4: Voting with ballot challenges. The polling place from Figure 2 sends a copy of all log data over a one-way channel to 
election headquarters (not shown) which aggregates this data from many different precincts and republishes it. This enables third-party 
“challenge centers” to provide challenge verification services to the field. 


Many states 


portunity to verify the presence of their own cast ballot 


have laws against connecting voting machines or tabula- 
tion equipment to the Internet—a good idea, given the 
known security flaws in present equipment. Our cryp- 
tographic techniques, combined with the data diode to 
preserve data within the precinct, offer some mitigation 
against the risks of corruption in the tallying infrastruc- 
ture. An observer could certainly measure the voting vol- 
ume of every precinct in real-time. This is not generally 
considered to be private information. 


VorteEBox systems do not need a printer on every voting 
machine; however, Benaloh’s printed ballot commitments 
offer one possibly valuable benefit: they allow any voter 
to take the printout home, punch the serial number into 
a web site, and verify the specific ballot ciphertext that 
belongs to them is part of the final tally, thus improving 
voters’ confidence that their votes were counted as cast. A 
VoreBox lacking this printer cannot offer voters this op- 
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ciphertexts. Challengers, of course, can verify that the ci- 
phertexts are correctly encrypted and present in the log in 
real-time, thus increasing the confidence of normal vot- 
ers that their votes are likewise present to be counted as 
cast. Optionally, Benaloh’s printer mechanism could be 
added to VoreBox, allowing voters to take home a printed 
receipt specifying the ciphertext of their ballot. 

Similarly, VorEBox systems do not need NIZKs. While 
NIZKs impose limits on the extent to which a malicious 
VoreBox can corrupt the election tallies by corrupting in- 
dividual votes, this sort of misbehavior can be detected 
through our challenge mechanism. Regardless, NIZKs 
would integrate easily with our system and would provide 
an important “sanity checking” function that can apply to 
every ballot, rather than only the challenged ballots. 
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3.4 Procedures 


To summarize the VoreBox design, let us review the steps 
involved in conducting an election with the system. 


Before the election. 


1. The ballot preparation software is used to create the 
necessary ballot definitions. 

2. Ballot definitions are independently reviewed for 
correctness (so that the ballot preparation software 
need not be trusted). 

3. Ballot definitions and key material (for vote encryp- 
tion) are distributed to polling places along with 
VoTEBox equipment. 


Election day: opening the polls. 


4. The AupiroriuM network is established and con- 
nected to the outside world through a data diode. 

5. All supervisor consoles are powered on, connected to 
the AupiTorIuM network, and one of them is enabled 
as the primary console (others are present for failover 
purposes). 

6. Booth machines are powered on and connected to the 
AUDITORIUM network. 

7. A “launch code” is distributed to the polling place by 
the election administrator. 

8. Poll workers open the polls by entering the launch 
code. 


The last step results in a “polls-open” Auprrorium mes- 
sage, which includes the launch code. All subsequent 
events that occur will, by virtue of hash chaining, prov- 
ably have occurred after this “polls-open” message, which 
in turn means they will have provably occurred on or after 
election day. 


Election day: casting votes. 


9. The poll worker interacts with the supervisor con- 
sole to enable a booth for the voter to use. This in- 
cludes selecting a machine designated as not in use 
and pressing an “authorize” button. 

The supervisor console broadcasts an authorization 
message directing the selected machine to interact 
with a voter, capture his preference, and broadcast 
back the result. 

If the booth does not have a copy of the ballot defi- 
nition mentioned in the authorization message, it re- 
quests that the supervisor console publish the ballot 
definition in a broadcast. 

The booth graphically presents the ballot to the voter 
and interacts with her, capturing her choices. 


10. 


11. 


12. 
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13. The booth shows a review screen, listing the voter’s 
choices. 

14. If the voter needs to make changes, she can do that 

by navigating backward through the ballot screens. 

Otherwise, she indicates she is satisfied with her se- 

lections. 

The booth publishes the encrypted ballot over the 

network, thereby committing to its contents. The 

voter may now choose one of two paths to complete 

her voting session: 


15. 


Cast her vote by pressing a physical button. The 
VorEBox signals to the voter that she may exit the 
booth area; it also publishes a message declaring that 
the encrypted ballot has been officially cast and can 
no longer be challenged. 


Challenge the machine by invoking a separate UI 
function. The challenged VoreBox must now reveal 
proof that the ballot was cast correctly. It does so 
by publishing the secret r used to encrypt the bal- 
lot; the ballot is no longer secret. This proof, like all 
AUDITORIUM traffic, is relayed to the outside world, 
where a challenge verifier can validate against the 
earlier commitment and determine whether the ma- 
chine was behaving correctly. The voter or poll 
workers can contact the challenge verifier out-of- 
band (e.g., with a smartphone’s web browser) to dis- 
cover the result of this challenge. Finally, the ballot 
committed to in step 15 is nullified by the existence 
of the proof in the log. The VoreBox resets its state. 
The challenge is complete. 


Election day: closing the polls. 


16. A poll worker interacts with the supervisor console, 
instructing it to close the polls. 

The supervisor console broadcasts a “polls-closed” 
message, which is the final message that needs to go 
in the global log. The hash of this message is sum- 
marized on the supervisor console. 

Poll workers note this value and promptly distribute 
it outside the polling place, fixing the end of the elec- 
tion in time (just as the beginning was fixed by the 
launch code). 


17. 


18. 


19. Poll workers are now free to disconnect and power 


off VorEBoxes. 


3.5 Attacks on the challenge system 


A key design issue we must solve is limiting communica- 
tion to voters, while they are voting, that might be used 
to coerce them into voting in a particular fashion. If a 
voter could see her vote’s ciphertext before deciding to 
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challenge it, she could be required to cast or challenge 
the ballot based on the ciphertext (e.g., challenge if even, 
cast if odd). An external observer could then catch her if 
she failed to vote as intended. Kelsey et al. [29] describe 
a variety of attacks in this fashion. Benaloh solves this 
problem by having the paper commitment hidden behind 
an opaque shield. We address it by requiring a voter to 
state that she intend to perform a challenge prior to ap- 
proaching a voting system. At this point, a poll worker 
can physically lock the “cast ballot” button and enable the 
machine to accept a vote as normal. While the VorEBox 
has no idea it is being challenged, the voter (or, absolutely 
anybody else) can freely use the machine, videotape the 
screen, and observe its network behavior. The challenger 
cannot, however, cast the ballot. 

Consequently, in the common case when voters wish to 
cast normal votes, they must not have access to the Av- 
DITORIUM network stream while voting. This means cellu- 
lar phones and other such equipment must be banned to 
enforce the privacy of the voter. (Such a ban is already 
necessary, in practice, to defeat the use of cellular tele- 
phones to capture video evidence of a vote being cast on 
traditional DRE systems.) 

A related attack concerns the behavior of a VorEBox 
once a user has gone beyond the “review selections” 
screen to the “cast?” screen (see Figure 3). If the voter 
wants to vote for Alice and the machine wants to defraud 
Alice, the machine could challenge votes for Alice while 
displaying the UI for a regular cast ballot. To address these 
phantom challenges, we take advantage of Auprrorium. 
Challenge messages are broadcast to the entire network 
and initiate a suitable alarm on the supervisor console. For 
a genuine challenge, the supervisor will be expecting the 
alarm. Otherwise, the unexpected alarm would cue a su- 
pervisor to offer the voter a chance to vote again. As a re- 
sult, a malicious VoreBox will be unable to surreptitiously 
challenge legitimate votes. Rather, if it misbehaved a suf- 
ficient number of times, it would be taken out of service, 
limiting the amount of damage it could cause. 


4 Discussion 


4.1 Implementation notes and experience 


Development of VoreBox has been underway since May 
of 2006; in that time the software has gone through a num- 
ber of metamorphoses that we briefly describe here. 


Secure software design. When we began the VorEBox 
implementation project, our initial goal was to develop a 
research platform to explore both security and human fac- 
tors aspects of the electronic voting problem. Our early 
security approaches were: (1) reduced trusted code base 
through use of PRUI due to Yee [53]; (2) software simula- 
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tion of hardware-enforced separation of components after 
the example of Sastry et al. [47]; and (3) hardware sup- 
port for strict runtime software configuration control (i.e., 
trusted computing hardware). 

Our original strategy for achieving trustworthy hard- 
ware was to target the Xbox 360 video game platform,° 
initially developing VoreBox as a Managed C# applica- 
tion. The Xbox has sophisticated hardware devoted to 
ensuring that the system runs only certified software pro- 
grams, which is an obviously useful feature for a DRE. 
Additionally, video game systems are designed to be inex- 
pensive and to withstand some abuse, making them good 
candidates for use in polling places. Finally, a lack of a 
sophisticated operating system is no problem for a pre- 
rendered user interface; we were fairly confident that an 
Xbox could handle displaying static pixmaps. We quickly 
found, however, that development for a more widely- 
available software platform was both easier for us and 
more likely to result in a usable research product. 

By the end of the 2006 summer we had ported VorEBox 
to Java. We had no intention of relying on Java’s AWT 
graphical interface (and its dependency, in turn, on a win- 
dow system such as X or Windows). Instead, we intended 
to develop VoreBox atop SDL, the Simple DirectMedia 
Layer,’ a dramatically simpler graphics stack. (The Pvote 
system also uses SDL as a side-effect of its dependency on 
the Pygame library [52].) Regrettably, the available Java 
bindings for SDL suffered from stability problems, forcing 
us to run our PRUI atop a limited subset of AWT (including 
only blitting and user input events). 

Our intended approach to hardware-inspired software 
module separation was twofold: force all modules to 
interact with one another through observable software 
“wires,” and re-start the Java VM between voters to pre- 
vent any objects lingering from one voting session to the 
next. Both of these ideas are due to Sastry’s example. In 
the end, only the latter survived in our design; VoTEBox 
essentially “reboots” between voters, but complexity and 
time constraints made our early software wire prototypes 
unworkable. 


Insecure software design. As mentioned above, we in- 
tended from the beginning that VorEBox would serve as 
a foundation for e-voting research of different stripes, in- 
cluding human factors studies. This would prove to be 
its earliest test; VorEBox found use in various studies car- 
ried out by Byrne, Everett, and Greene between 2006 and 
2008 [15, 16]. Working in close coordination with these 
researchers, we developed ballot designs and tuned the 
VoreBox user experience to meet their research needs. 
(The specific graphic design of the ballot shown in Fig- 
ure | is owed to this collaboration.) 
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We also modified VoreBox to emit fine-grained data 
tracking the user’s every move: the order of visited 
screens, the time taken to make choices, and so forth. This 
sort of functionality would be considered a breach of voter 
privacy in a real voting system, so we took great pains to 
make very clear the portions of the code that were inserted 
for human factors studies. Essential portions of this code 
were sequestered in a separate module that could be left 
out of compilation to ensure that no data collection can 
happen on a “real” VoreBox; later we made this distinc- 
tion even more stark by dividing the VorEBox codebase 
into two branches in our source control system. 

It is noteworthy that some of the most interesting hu- 
man factors results [16, studies 2 and 3] require a mali- 
cious VorEBox. One study measured how likely voters are 
to notice if contests are omitted from the review screen; 
another, if votes on the review screen are flipped from the 
voter’s actual selection. If data collection functionality 
accidentally left in a “real” VorEBox is bad, this code is 
far worse. We added the word “evil” to the names of the 
relevant classes and methods so that there would be no 
confusion in a code auditing scenario. 


S-expressions. When it came time to develop the Av- 
DITORIUM network protocol, we chose to use a subset of 
the S-expression syntax defined by Rivest [38]. Previous 
experiences with peer-to-peer systems that used the con- 
venient Java ObjectOutputStream for data serialization re- 
sulted in protocols that were awkwardly bound to partic- 
ular implementation details of the code, were difficult to 
debug by observation of data on the wire, and were inex- 
orably bound to Java. 

S-expressions, in particular the canonical representa- 
tion used in AupiTorium, are a general-purpose, portable 
data representation designed for maximum readability 
while at the same time being completely unambiguous. 
They are therefore convenient for debugging while still 
being suitable for data that must be hashed or signed. By 
contrast, XML requires a myriad of canonicalization algo- 
rithms when used with digital signatures; we were happy 
to leave this large suite of functionality out of VoTEBox. 

We quickly found S-exps to be convenient for other por- 
tions of VoreBox. They form the disk format for our se- 
cure logs (as carbon-copies of network traffic, this is un- 
surprising). Pattern matching and match capture, which 
we added to our S-exp library initially to facilitate parsing 
of AupITORIUM messages, subsequently found heavy use 
at the core of QuERIFIER [44], our secure log constraints 
checker, allowing its rule syntax to be naturally expressed 
as S-exps. Even the human factors branch of VorEBox 
dumps user behavior data in S-expressions. 
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module semicolons stripped LOC 
sexpression 1170 2331 
auditorium 1618 3440 
supervisor 959 1525 
votebox 3629 7339 
7376 14635 


Table 1: Size of the VoreBox trusted codebase. Semicolons 
refers to the number of lines containing at least one ‘;’ char- 
acter and is an approximation of the number of statements 
in the code. Stripped LOC refers to the number of non- 
whitespace, non-comment lines of code. The difference is a 
crude indicator of the additional syntactic overhead of Java. 
Note that the ballot preparation tool is not considered part 
of the TCB, since it generates ballots that should be audited 
directly; it is 4029 semicolons (6657 stripped lines) of Java 
code using AWT/Swing graphics. 


Code size. Table 1 lists several code size metrics for 
the modules in VoreBox, including all unit tests. We as- 
pired to the compactness of Pvote’s 460 Python source 
lines [52], but the expanded functionality of our system, 
combined with the verbosity of Java (especially when 
written in clear, modern object-oriented style) resulted in 
a much larger code base. The votebox module (anal- 
ogous to Pvote’s functionality) contains nearly twenty 
times as many lines of code. The complete VorEBox code- 
base, however, compares quite favorably with current DRE 
systems, making thorough inspection of the source code a 
tractable proposition. 


4.2 Performance evaluation and estimates 


By building a prototype implementation of our design, we 
are able to validate that it operates within reasonable time 
and space bounds. Some aspects of VorEBox require “real 
time” operation while others can safely take minutes or 
hours to complete. 


Log publication. Recall that VoreBoxes, by virtue of 
the fact that they communicate with one another using 
the AupiToriIuM protocol, produce s-expression log data 
which serves as a representation of the events that hap- 
pened during the election. An important design goal is 
the allowance of outside parties to see this log data in real 
time; our immediate ballot challenge protocol relies on it. 

We’ve assumed, as a worst case, that the polling place 
is connected to election central with a traditional modem. 
This practical bandwidth limitation forces us to explore 
the size of the relevant log messages and examine their 
impact on the time it takes to perform an immediate ballot 
challenge. This problem is only relevant if the verifica- 
tion machine is not placed on the polling place network 
(on the public side of the data diode). With the verifica- 
tion machine on the LAN, standard network technology 
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will be able to transmit the log data much faster than any 
reasonable polling place could generate it. 

A single voter’s interaction with the polling place re- 
sults in in the following messages: (1) an authorization 
message from the supervisor to the booth shortly after the 
voter enters the polling place, (2) a commitment message 
broadcast by the booth after the voter is done voting, (3) 
either a cast ballot message or a challenge response mes- 
sage (the former if the voter decides to cast and the latter 
if the voter decides to challenge), (4) and an acknowledg- 
ment from the supervisor that the cast ballot or challenge 
has been received, which effectively allows the machine 
to release its state and wait for the next authorization. 

Assuming all the crypto keys are 1024-bits long, an 
authorization-to-cast message is 1 KB. Assuming 30 se- 
lectable elements are on the ballot, both commit and cast 
messages are 13 KB while challenge response messages 
are 7 KB. An acknowledgment is | KB. 

We expect a good modem’s throughput to be 
5 KB/second. The challenger must ask the machine to 
commit to a vote, wait for the verification host to receive 
the commitment, then ask the machine to challenge the 
vote. (The voter must wait for proof of the booth’s com- 
mitment in order for the protocol to work.) In the best 
case, when only one voter is in the polling place (and 
the uploader’s buffer is empty), a commitment can be im- 
mediately transmitted. This takes under 3 seconds. The 
challenge response can be transmitted in under 2 seconds. 
In the worst case, when as many as 19 other voters have 
asked their respective booths to commit and cast their bal- 
lots, the challenger must wait for approximately 494 KB 
of data to be uploaded (on behalf of the other voters). 
This would take approximately 100 seconds. Assuming 
19 additional voters, in this short time, were given access 
to booths and all completed their ballots, the challenger 
might be forced to wait another 100 seconds before the 
challenge response (the list of r-values used to encrypt 
the first commitment) could make it through the queue. 

Therefore, in the absolute worst case situation (30 ele- 
ments on the ballot and 20 machines in the polling place), 
the challenger is delayed by a maximum of 200 seconds 
due to bandwidth limitations. 


Encryption. Because a commitment is an encrypted 
version of the cast ballot, a cast ballot must be encrypted 
before a commitment to it is published. Furthermore, the 
verifier must do a decryption in order to verify the result 
of a challenge. Encryption and decryption are always a 
potential source of delay, therefore we examine our im- 
plementation’s encryption performance here. 

Recall that a cast ballot is an n-tuple of integers, and an 
encrypted cast ballot has each of these integers encrypted 
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using our additively homomorphic El Gamal encryption 
function. We benchmarked the encryption of a reference 
30 candidate ballot; on a Pentium M 1.8 GHz laptop it 
took 10.29 CPU seconds, and on an Opteron 2.6 GHz 
server it took 2.34 CPU seconds. We also benchmarked 
the decryption, using the r-values generated by the en- 
cryption function (simulating the work of a verification 
machine in the immediate ballot challenge protocol). On 
the laptop, this decryption took 5.18 CPU seconds, and on 
the server it took 1.27 CPU seconds. 

The runtime of this encryption and decryption will be 
roughly the same. However, there is one caveat. To make 
our encryption function additively homomorphic, we ex- 
ponentiate a group member (called f in equation 1) by the 
plaintext counter (called c in equation 1). (The result is 
that when this value is multiplied, the original counter gets 
added “in the exponent.”) Because discrete log is a hard 
problem, this exponentiation cannot be reversed. Instead, 
our implementation stores a precomputed table of encryp- 
tions of low counter values. We assumes that, in real elec- 
tions, these counters will never be above some reasonable 
threshold (we chose 20,000). Supporting counters larger 
than our precomputed table would require a very expen- 
sive search for the proper value. 

This is never an issue in practice, since individual bal- 
lots only ever encrypt the values O and 1, and there will 
never be more than a few thousand votes per day in a 
given precinct. While there may be a substantially larger 
number of votes across a large city, the election official 
only needs to perform the homomorphic addition and de- 
cryption on a precinct-by-precinct basis.* This also allows 
election officials to derive per-precinct subtotals, which 
are customarily reported today and are not considered to 
violate voter privacy. Final election-night tallies are com- 
puted by adding the plaintext sums from each precinct. 


Log analysis. There are many properties of the pub- 
lished logs that we might wish to validate, such as ensur- 
ing that all votes were cast while the polls were open, that 
no vote is cast without a prior authorization sharing the 
same nonce, and so on. These properties can be validated 
by hand, but are also amenable to automatic analysis. We 
built a tool called Queririer [44, 45] that performs this 
function based on logical predicates expressed over the 
logs. None of these queries need to be validated in real 
time, so performance is less critical, so long as answers 
are available within hours or even days after the election. 


4.3 Security discussion 


Beyond the security goals introduced in Section | and 
elaborated in Section 3, we offer a few further explo- 
rations of the security properties of our design. 
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Ballot decryption key material. We have thus far 
avoided the topic of which parties are entitled to decrypt 
the finished tally, assuming that there exists a single entity 
(perhaps the director of elections) holding an El Gamal 
private key. We can instead break the decryption key 
up into shares [49, 13] and distribute them to several 
mutually-untrusting individuals, such as representatives 
of each major political party, forcing them to cooperate 
to view the final totals. 

This may be insufficient to accommodate varying le- 
gal requirements. Some jurisdictions require that each 
county, or even each polling place, be able to generate 
its own tallies on the spot once the polls close. In this 
case we must create separate key material for each tal- 
lying party, complicating the matter of who should hold 
the decryption key. Our design frees us to place the de- 
cryption key on, e.g., the supervisor console, or a USB key 
held by a local election administrator. We can also use 
threshold decryption to distribute key shares among mul- 
tiple VoreBoxes in the polling place or among mutually- 
untrusting individuals present in the polling place. 


Randomness. Our El Gamal-based cryptosystem, like 
many others, relies on the generation of random numbers 
as part of the encryption process. Since the ciphertext 
includes g’, a malicious voting machine could perform 
O(2*) computations to encode k bits in g”, perhaps leak- 
ing information about voters’ selections. Karlof et al. [28] 
suggest several possible solutions, including the use of 
trusted hardware. Verifiable randomness may also be pos- 
sible as a network service or a multi-party computation 
within the VorEBox network [21]. 


Mega attacks. We believe the Auprrorium network of- 
fers defense against mishaps and failures of the sort al- 
ready known to have occurred in real elections. We fur- 
ther expect the networked architecture to provide some 
defense against more extreme failures and attacks that 
are hypothetical in nature but nonetheless quite serious. 
These “mega attacks,” such as post-facto switched results, 
election-day shadow polling places, and armed booth cap- 
ture (described more fully in previous work [46]), are 
challenges for any electronic voting system (and even 
most older voting technologies as well). 


5 Conclusions and future work 


In this paper we have shown how the VoreBox system 
design is a response to threats, real and hypothesized, 
against the trustworthiness of electronic voting. Recog- 
nizing that voters prefer a DRE-style system, we endeav- 
ored to create a software platform for e-voting projects 
and then assembled a complete system using techniques 
and ideas from current research in the field. VorEBox cre- 
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ates audit logs that are believable in the event of a post- 
facto audit, and it does this using the AuprrortuM network- 
ing layer, allowing for convenient administration of polls 
as well as redundancy in case of failure. Its code complex- 
ity is kept under control by moving inessential graphics 
code outside the trusted system, with the side effect that 
ballot descriptions can be created—and audited—long be- 
fore election day. Finally, the immediate ballot capture 
technique gives real power to random machine audits. 
Any voter can ask to challenge any voting machine, and 
the machine has no way to know it is under test before it 
commits to the contents of the encrypted ballot. 


VotEBox is a complete system and yet still an ongoing 
effort. It is still being actively used for human factors ex- 
perimentation, work which spurs evolution and maturity 
of the software. Many of VoreBox’s features were de- 
signed with human factors of both poll workers and vot- 
ers in mind. Evaluating these with human subject testing 
would make a fascinating study. For example, we could 
evaluate the rate at which voters accidentally challenge 
ballots, or we could ask voters to become challengers and 
see if they can correctly catch a faulty machine. 


We have a number of additional features and improve- 
ments we intend to add or are in the process of adding to 
the system as well. Because one of the chief benefits of 
the DRE is its accessibility potential, we anticipate adding 
support for unusual input devices; similarly, following the 
example of Pvote, we expect that VorEBox’s ballot state 
machines will map naturally onto the problem of provid- 
ing a complete audio feedback experience to match the 
video display. As we continue to support human factors 
testing, it is obviously of interest to continue to maintain 
a clear separation and identification of “evil” code; tech- 
niques to statically determine whether this code (or other 
malicious code) is present in VoreBox will increase our 
assurance in the system. We are in the process of in- 
tegrating NIZK proofs into our El Gamal encrypted vote 
counters, further bolstering out assurance that VorEBox 
systems are behaving correctly. We intend to expand our 
use of QUERIFIER to automatically and conveniently an- 
alyze Aupitorium logs and confirm that they represent 
valid election events. A tabulation system for VorEBox 
is another logical addition to the architecture, completing 
the entire election life cycle from ballot design through 
election-day voting (and testing) to post-election auditing 
and vote tabulation. Finally, we note that as a success- 
ful story of combining complementary e-voting research 
advances, we are on the lookout for other suitable tech- 
niques to include in the infrastructure to further enhance 
the end-to-end verifiability, in hope of approaching true 
software independence in a voter-acceptable way. 
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Notes 


l http://www.scantegrity.org 

The Hart InterCivic eSlate voting system also includes a polling 
place network and is superficially similar to our design; unfortunately, 
the eSlate system has a variety of security flaws [24] and lacks the fault 
tolerance, auditability, and end-to-end guarantees provided by VorEBox. 

3While this simple counter-based ballot does not accommodate 
write-in votes, homomorphic schemes exist that allow more flexible bal- 
lot designs, including write-ins [31]. 

4 An interesting risk with a data diode is ensuring that it is installed 
properly. Polling place systems could attempt to ping known Inter- 
net hosts or otherwise map the local network topology, complaining if 
two-way connectivity can be established. We could also imagine color- 
coding cables and plugs to clarify how they must be connected. 

Invariably, some percentage of regular voters will accidentally chal- 
lenge their ballots. By networking the voting machines together and 
raising an alarm for the supervisor, these accidental challenges will only 
inconvenience these voters rather than disenfranchising them. Further- 
more, accidental challenges helpfully increase the odds of machines 
being challenged, making it more difficult for a malicious VorEBox to 
know when it might be able to cheat. 

©The VoreBox name derives in part from this early direction, known 
at the time as the “BALLoTBOx 360”. 

Thttp://www.sdl.org 

8Vote centers, used in some states for early voting and others for 
election day, will have larger numbers of votes cast than traditional small 
precincts. Voting machines could be grouped into subsets that would 
have separate AuprroriuM networks and separate homomorphic tallies. 
Similarly, over a multi-day early voting period, each day could be treated 
distinctly. 
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Abstract 


It is well known that the use of native methods in Java 
defeats Java’s guarantees of safety and security, which is 
why the default policy of Java applets, for example, does 
not allow loading non-local native code. However, there 
is already a large amount of trusted native C/C++ code 
that comprises a significant portion of the Java Develop- 
ment Kit (JDK). We have carried out an empirical secu- 
rity study on a portion of the native code in Sun’s JDK 
1.6. By applying static analysis tools and manual inspec- 
tion, we have identified in this security-critical code pre- 
viously undiscovered bugs. Based on our study, we de- 
scribe a taxonomy to classify bugs. Our taxonomy pro- 
vides guidance to construction of automated and accurate 
bug-finding tools. We also suggest systematic remedies 
that can mediate the threats posed by the native code. 


1 Introduction 


Since its birth in the mid 90s, Java has grown to be one 
of the most popular computing platforms. Recogniz- 
ing Java’s importance, security researchers have scruti- 
nized Java’s security from its early days (c.f., [8, 28, 32, 
25]). Various vulnerabilities in the Java security model 
have been identified and fixed; formal models of vari- 
ous aspects of Java security have been proposed (e.g., 
[41, 13]), sometimes with machine-checked theorems 
and proofs [22]. 

In this paper we examine a less-scrutinized aspect of 
Java security: the native methods used by Java classes. 
It is well known that once a Java application uses native 
C/C++ methods through the Java Native Interface (JNI), 
any security guarantees provided by Java might be in- 
validated by the native methods. Figure | shows a con- 
trived example. The Java class “Vulnerable” contains a 
native method, which is realized by a C function. The C 
function is susceptible to a buffer overflow as it performs 
an unbounded string copy to a 512-byte buffer. Conse- 
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Java code 


class Vulnerable { 
/ / Declare a native method 
private native void bcopy(byte[] arr); 
public void byteCopy(byte[] arr) { 
//Call the native method 
bcopy (arr) ; 


} 


static { 
System. loadLibrary ("Vulnerable") ; 


} 
} 


#include <jni.h> 

#include "Vulnerable.h" 

JNIEXPORT void JNICALL Java_Vulnerable_ bcopy 
(JNIEnv xenv, jobject obj, jobject arr) 


{ 


char buffer[512]; 

jbyte «carr; 

carr = («env) ->GetByteArrayElements 
(env,arr,0); 

//Unbounded string copy to a local buffer 

strcpy (buffer, carr) ; 

(xenv) ->ReleaseByteArrayElements 

(env,arr,carr,0); 


Figure 1: Vulnerable JNI Code. 


quently, an attacker can craft malicious inputs to the pub- 
lic Java byteCopy() method, and overtake the JVM. 


Due to the fundamental insecurity of native C/C++ 
code, the default policy of Java applets, for example, 
does not allow loading non-local native code. Nonethe- 
less, there is already a large amount of trusted native 
code that comprises a significant portion of the Java De- 
velopment Kit (JDK). For instance, the classes under 
java.util.zip in Sun’s JDK are just wrappers that invoke 
the popular Zlib C library. In JDK 1.6, there are over 
800,000 lines of C/C++ code. Over the time, the size 
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of C/C++ code has been on the increase: JDK 1.4.2 has 
500,000 lines; JDK 1.5 has 700,000 lines; and JDK 1.6 
has 800,000 lines. Any vulnerability in this trusted na- 
tive code can compromise the security of the JVM. Sev- 
eral vulnerabilities have already been discovered in this 
code [33, 38, 37]. 

Since the native code in the JDK is critical to Java 
security, examining and ensuring its security is of great 
practical value. As a first step toward this goal, we have 
carried out an empirical security study of this large and 
security-critical code. Our research makes the following 
contributions: 


e This is the first systematic security study of the na- 
tive code in Sun’s JDK, a security-critical and ubiq- 
uitous piece of software. A few sporadic bug reports 
exist, but none have scrutinized this aspect of Java 
security. 


We discovered previously unknown security-critical 
bugs (59 in total). By removing them, the over- 
all Java security will be strengthened. Furthermore, 
we describe a taxonomy of bugs based on our study 
(Section 3). New bug patterns that arise in the con- 
text of the JNI are discussed and analyzed. Our tax- 
onomy provides guidance to construction of scal- 
able and accurate bug-finding tools. 


We will propose remedies (Section 4) to mediate the 
threats posed by the native code, with various trade- 
offs among security, performance, and effort. We 
also discuss limitations of current approaches and 
point out future directions. 


2 Overview of the JDK’s native code and 
our approach to characterizing bug pat- 
terns 


The JNI is Java’s mechanism for interfacing with native 
C/C++ code. Programmers use the native modifier to 
declare native methods in Java classes (e.g. the bcopy 
method in Figure | is declared as a native method). Once 
declared, native methods can be invoked in Java in the 
same way as how ordinary Java methods are invoked. 
Programmers then provide in C or C++ implementation 
of the declared native methods. The implementation can 
use various API functions provided by the JNI interface 
to cooperate with the Java side. Through the API func- 
tions, native methods can inspect, modify, and create 
Java objects, invoke Java methods, catch and throw Java 
exceptions, and so on. 

In the source’ directories share/native, 
solaris/native, and windows/native of 
Sun’s JDK 1.6 (v6u2), there are over 800,000 lines of 
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C/C++ code (counted using wc). The native code in 
these directories implements the native methods declared 
in the JDK classes. The native code in the directory 
share/native is shared across platforms, while the 
code in solaris/native and windows/native 
is platform dependent. The majority of the native code 
in the JDK is in the C language; around 700,000 lines 
are in C, while the rest are in C++. In our following 
discussion, we will mostly refer to the C code in the 
JDK. All of our discussion, unless specially noted, 
applies to the C++ code as well. 

The 800k lines of native code can be conceptually di- 
vided into two parts: library code and interface code. 
The library code is the C code that belongs to a com- 
mon C library. For example, the code under share/ 
native/java/util/zip/zlib-1.1.3 is from 
Zlib 1.1.3. The interface code implements Java native 
methods, and glues Java with C libraries through the JNI. 
For example, the C code in native/java/util/ 
zip/Deflater.c implements the native methods in 
the java.util.zip.Deflater class, and glues 
Java with the Zlib C library. 


Our approach to characterizing bug patterns. Given 
the large amount of trusted native code in the JDK, bugs 
are likely to exist. Our ultimate goal is to build highly 
automatic tools that can identify bugs in the JDK’s na- 
tive code. However, as no general methodology exists to 
identify all bugs accurately in a program, we believe that 
the important first step is to collect empirical evidence, 
and characterize relevant bug patterns. Only after this 
due diligence, we can select the right techniques to take 
advantage of the domain knowledge of the JDK and the 
JNI, and construct effective bug-finding tools. 

In the first step, we intend to cover as many bug pat- 
terns as we can. We decided to scan the source code 
using off-the-shelf static analysis tools and also simple 
tools (scripts and scanners) built by us. Although these 
tools are inaccurate, their scanning results are fairly com- 
plete and thus enable us to compile enough evidence to 
conclude the characteristics of bug patterns. Next, we 
discuss the tools used in our study: 


e To scan the common bug patterns inside C code, 
such as buffer overflows, integer overflows, and race 
conditions, we used a combination of Splint [10], 
Cigital’s ITS4 [39], and Flawfinder [42]. We chose 
a combination of these tools, rather than a single 
one, because their strengths complement one an- 
other. For example, Splint performs full parsing 
and can flag many incompatible type casts. ITS4 
and Flawfinder can flag time-of-check-to-time-of- 
use (TOCTTOU) flaws, among others. 


e Some bug patterns in the JDK’s native code are 
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particular to the Java Native Interface (JNI) and 
we cannot use existing tools to scan for errors in 
these patterns. We have built simple tools, includ- 
ing grep-based scripts and scanners implemented in 
CIL [30], to search for bugs in these patterns. 


e For the list of warnings produced by the static anal- 
ysis tools, we manually inspected the source code 
to identify true bugs. To help the manual inspec- 
tion, we used the GNU GLOBAL Source Code Tag 
System [16] to build a database of tags in the JDK 
source code, and used htags to generate HTML files 
for the source code. This made source-code navi- 
gation much easier. For example, with one click, 
we can find all places where a particular function is 
invoked. 


Although the foregoing approach is sufficient for char- 
acterizing bug patterns, it is clear the tools will not be 
scalable to cover all 800,000 lines of native code in the 
JDK. In Section 4, we will discuss techniques that make a 
significant progress toward providing safety to the JDK’s 
native code. 


Target directories. Limited by our time to per- 
form manual inspection, we focused our study on the 
code under the directories share/native/java and 
solaris/native/java. We will call these directo- 
ries the target directories in the following text. The target 
directories include approximately 38,000 lines of C code, 
which implement the native methods in the java. 
classes. 


3 Taxonomy of bugs in the JDK’s native 
code 


We now present a collection of bug patterns in the JDK’s 
native code. Some of these patterns are well known, such 
as buffer overflows, but we will discuss them in the con- 
text of the JDK. Some bug patterns are due to the mis- 
match between Java’s programming model and C’s, and 
thus are unique in the context. 

Table 1 shows a summary of the results of our security 
study. For each bug pattern, the table shows the number 
of bugs we identified. We include a bug in the table when 
two conditions hold. First, there must be a programming 
error in the native code. For example, the C code in 
Figure | has a programming error, which performs an 
unbounded string copy. The second condition is that an 
attacker must be able to trigger the programming error. 
For the example in Figure 1, the attacker can trigger the 
error of unbounded string copy by passing malicious data 
to the Vulnerable class. 
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Table 1 also classifies whether a bug pattern is security 
critical. We define that a bug pattern is security critical 
if, by exploiting bugs in the pattern, an attacker can take 
over the JVM, gain authorized privileges, or crash the 
JVM (a denial-of-service attack). A security-critical bug 
is a vulnerability. 

Finally, Table 1 shows the static analysis tools we used 
to identify the bugs in a bug pattern, and the section that 
describes detailed findings on the bug pattern. In each 
section, we will show representative examples, but refer 
readers to the appendix of our technical report [35] for a 
full list of the bugs we identified. We will also suggest 
ad-hoc fixes for some bug patterns, but defer discussions 
of more systematic remedies to the next section. 

Not included in the table are the false positive rates of 
the static analysis tools; they will be presented when we 
discuss static analysis as a remedy in the next section. 


3.1 Unexpected control flows due to mis- 
handling Exceptions 


The JNI interface provides API functions such as Throw 
and ThrowNew for raising Java exceptions. By throw- 
ing an exception, a native method can notify the JVM 
of errors. However, there is a mismatch between Java’s 
exception-handling mechanism and the JNI’s. In Java, 
when an exception occurs, the JVM automatically trans- 
fers the control to the nearest enclosing try/catch state- 
ment that matches the exception type. In contrast, an 
exception raised through the JNI does not immediately 
disrupt the native method execution, and only after the 
native method finishes execution will the JVM mech- 
anism for exceptions start to take over. Therefore, 
JNI programmers must explicitly implement the control 
flow after an exception has occurred, by either imme- 
diately returning to Java or checking and clearing the 
exception explicitly using JNI API functions such as 
ExceptionOccurred and ExceptionClear. 

Because Java and the JNI handle exceptions differ- 
ently, it is easy for JNI programmers to make mistakes. 
Figure 2 presents a contrived example that shows how 
mishandling of exceptions may lead to vulnerabilities. 
At first sight, the strcpy from the incoming Java ar- 
ray to a local buffer is safe: there is a bounds check 
before the copy, and when the check fails, an exception 
is thrown. However, since the exception does not dis- 
rupt the control flow, the strcpy will always be exe- 
cuted and may result in an unbounded string copy. This 
example shows that mishandling exceptions creates un- 
expected control-flow paths where dangerous operations 
might happen. 

The fix for the example in Figure 2 is simple—just 
put a return statement after the throwing-exception state- 
ment. However, it becomes complicated when function 
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BUG PATTERNS ERRORS |SECURITY | STATIC TOOLS USED | SECTION 
CRITICAL 

Unexpected control flows due to 11 Y grep-based scripts 3.1 
mishandling exceptions 

C pointers as Java integers 38 N Our scanner 3.2 

(implemented in CIL) 
Race conditions in file accesses 3 Y ITS4, Flawfinder 3.3 
Buffer overflows 5* Y a he ae 3.4 
Mem. management | C mem. 1 N Splint 3.5 
flaws Java mem. 28 N grep-based scripts , 
Insufficient error JNI APIs 35 Y grep-based scripts 3.6 
checking misc. 5 Y Splint : 
TOTAL 126 59 














*One buffer-overflow flaw is not in the target directory. 


Table 1: A summary of the bugs we identified in the target directories. 


void Java_Vulnerable bcopy (JNIEnv *env, jobject obj, jbyteArray jarr) { 
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char buffer[512]; 


if ((*env) ->GetArrayLength(env, jarr) 


> 512) { 


JNU_ThrowArrayIndexOutOfBoundsException(env, 0); 


i 


//Get a pointer to the Java array, then copy the Java array to a local buffer 


jbyte «carr = 


strepy (buffer, carr) ; 


(*env) ->GetByteArrayElements (env, 


jarr, NULL); 


(*env) ->ReleaseByteArrayElements(env,arr,carr,0); 


} 


Figure 2: An example of mishandling JNI exceptions 


calls are involved. Imagine a C function, say £, invokes 
another C function, say g, and the function g throws an 
exception when an error occurs. The £ function has to 
explicitly deal with two cases of calling g: the success- 
ful case, and the exceptional case. Mishandling it may 
result in the same error as the one in Figure 2. It becomes 
much more complicated when the C function £ invokes 
a Java method. The JVM mechanism for exceptions will 
not take effect until the C function returns, even for the 
exceptions raised in the Java method. 


We developed a grep-based script to search for all 
places where an exception is explicitly thrown. Of the 
337 hits in the target directories, we found 11 places 
where the control flows for exceptions are implemented 
incorrectly. A representative example from solaris/ 
native/java/lang/UNIXProcess.md.c is 
shown in Figure 3. The macro NEW invokes the 
function xmalloc, which in turn invokes malloc 
to allocate a specified amount of memory. If the 
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malloc function returns null, the NEW throws a 
JNU_ThrowOutOfMemoryError exception. How- 
ever, the exception does not disrupt the control flow, 
and as a result the pathv variable in split Path gets 
null. The subsequent “pathv[count] = NULL” 
will crash the JVM. 

We classify this bug pattern as being security criti- 
cal because dangerous operations in unexpected control- 
flow paths may enable an attacker to crash or take over 
the JVM. 


3.2 C pointers as Java integers 


Programs that use the JNI often need to pass C point- 
ers through Java. Due to differences between Java’s type 
system and C’s, it is difficult (and sometimes impossi- 
ble) for Java to assign types to C pointer values. The 
commonly used pattern in JNI programming is to cast C 
pointers to Java integers, and pass the resulting integers. 


USENIX Association 


static voids xmalloc(JNIEnv xenv, 
void *p = malloc(size) ; 


size t size) { 


if (p == NULL) JNU_ThrowOutOfMemoryError(env, NULL) ; 


return p; 


} 


#define NEW(type, n) ((type *) 


static const char * const *« splitPath(JNIEnv xenv, 





pathv = NEW(chars*, count+1); 
pathv[count] = NULL; 


Figure 3: An excerpt from solaris/native/java/lang/UNIXProcess_md.c. 


“pathv[count] = NULL” will be executed. 


The pattern is used, for example, in the class 
java.util.zip.Deflater. The Deflater class 
supports compression using the Zlib C library. The Zlib 
library maintains a C structure (z-stream) for stor- 
ing the state information of a compression data stream. 
A Deflater object holds a pointer to the z_stream 
structure, so that when the object calls Zlib the second 
time, the state information can be recovered through the 
pointer. As it is impossible for Java to declare the pointer 
as having the C type “z_stream x”, the C code casts it 
into an integer before passing it to Java: 


typedef struct z_stream_s {...} z stream; 


jlong Java_java_util_zip Deflater_init 


(sade Qo of 
z stream *«strm = 
calloc(1, sizeof(z stream) ); 
/ / initialize strm 
return (jlong) strm; //cast it to an integer 


} 


Whenever Java needs to access the compression 
stream, it passes to C the integer. C code then casts the 
integer back to a z_st ream pointer, through which the 
state information of the stream can be retrieved or up- 
dated. 

From Java’s perspective, integers that represent C 
pointers are just ordinary Java integers. The pattern of 
treating C pointers as Java integers is unsafe if an at- 
tacker can inject to the C side arbitrary integer values that 
will be interpreted as pointers. Greenfieldboyce and Fos- 
ter [17] examined the Gimp Toolkit (GTK) and discov- 
ered seven places where the injection of arbitrary integers 
is possible. For example, the native method set Focus 
in the GTK (shown below) has an integer parameter that 
represents a window pointer. Since the method is de- 
clared as a public method, an attacker can invoke it with 
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xmalloc(env, 


(n) * sizeof (type) ) ) 


const char «path) { 


Even when xmalloc returns NULL, 


an arbitrary integer value, which may corrupt memory 
and result in JVM crashes. 


class GUILib { 
public native static void 
setFocus (int windowPtr) ; 


} 


We have built a custom scanner that searches for dan- 
gerous type casts from integers to pointers. The scanner 
is implemented in the CIL framework [30] as a CIL fea- 
ture. We found 38 native methods that accept Java inte- 
gers as arguments and then cast the integers to pointers. 
Compared to the GTK, the JDK’s protection of these in- 
tegers is safer. First, the native methods are all declared 
as private methods. An attacker cannot invoke them arbi- 
trarily. Second, the Java integers that represent C point- 
ers are stored in private fields. 


If we assume Java’s access control rules on private 
fields and methods are strictly enforced, then the JDK’s 
protection on the integers is sufficient. However, with 
the Java reflection API, a Java program can at runtime 
change the private fields that store the C pointers, or in- 
voke private methods. 


If an attacker can use the Java reflection API, then he 
can read and write arbitrary memory locations by ex- 
ploiting the pattern of C pointers as Java integers. For ex- 
ample, the getAdler native method (shown below) in 
the java.util.zip.Deflater class accepts a Java 
long, casts it to a pointer to the z_st ream struct, and re- 
turns the adler field in the struct. If an attacker invokes 
it with the number that equals a target memory address 
minus the offset of the adler field, then he can read the 
value at the target address. 
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jint Java_java_util_zip Deflater_getAdler 


(..., jlong strm) { 
return ((z stream *)strm) ->adler; 


} 


In a similar vein, the attacker can write to any memory 
location with his data through the setDictionary 
method in the Deflater class; the setDictionary 
method updates a z_stream structure with user- 
supplied data. 

Although the default security policy when running un- 
trusted Java code does not allow the Java reflection, we 
believe that passing C pointers as Java integers is dan- 
gerous, for the following reason. For a program in pure 
Java, an attacker can violate the access-control policy of 
the Java program (e.g. reading private fields) using the 
Java reflection, but the program remains type safe, which 
implies no reading/writing arbitrary memory locations. 
However, with the Java reflection and passing C point- 
ers as Java integers through the JNI, an attacker could 
violate type safety by reading/writing arbitrary memory 
locations (shown by previous examples). We believe 
the privilege escalation from using the Java reflection to 
reading/writing arbitrary memory locations is a violation 
of the Java security model. 


Proposed fixes. We recommend a fix based on an indi- 
rection table of pointers, similar to the OS file-descriptor 
table. The C side uses the indirection table to store point- 
ers and passes table IDs, not pointers, to Java. When C 
gets the table IDs back from Java, it checks the validity 
of the IDs before carrying out dangerous operations. If 
bogus IDs were passed to C, the validity-checking step 
would catch it. 


3.3. Race conditions in file accesses 


Time-of-check-to-time-of-use (TOCTTOU) bugs refer to 
race conditions in which “a program checks for a par- 
ticular characteristic of an object, and then takes some 
action that assumes the characteristic still holds when in 
fact it does not’ [3]. Bishop and Dilger [3] identified a 
category of TOCTTOU bugs in file accesses. Such flaws 
occur, for example, when a program checks the access 
privilege of a file through a file path name and then use 
the file through the same file path name. Between the 
check and the use, if an attacker can change the file asso- 
ciated with the file path name, then the program may be 
fooled to access privileged files that the attacker cannot 
access otherwise. 

We used ITS4 and Flawfinder to scan for file-access 
race conditions in the JDK. We identified three places 
in the target directories where file-access race condi- 
tions might occur. An example in solaris/native/ 
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java/io/UnixFileSystem.md.c is the race win- 
dow between stat (line 144) and chmod (line 236). If 
the file in question were in a directory writable by the at- 
tacker, then during the race window he can link to that 
file any target file. The chmod at line 236 will then 
change the protection mode of the target file. 

Besides the three race conditions we identified, we 
also discovered that the implementation of all the native 
methods in the class java.io.UnixFileSystem is 
based on path names, instead of file descriptors. For ex- 
ample, the checkAccess method checks whether the 
file or directory denoted by a given path name may be ac- 
cessed; the set Permission method set on or off the 
access permission of the file or directory denoted by a 
given path name. The class java.io.File, aclient of 
java.io.UnixFileSystem, uses checkAccess 
in methods such as canRead to check access per- 
missions of a file path name stored in a field of 
java.io.File. It also uses setPermission in 
methods such as setReadab1e to set access permis- 
sions of the file path name. As a result, a Java pro- 
gram that uses java.io.File may have race con- 
ditions, if it first invokes canRead, and then invokes 
setReadable. 

File-access race conditions are most relevant in a 
multi-user system, which is not a typical environment 
of using Java. Nevertheless, Java has been and will be 
used in a diverse set of scenarios (e.g., Java programs are 
run as root in the Java Authorization Toolkit [1]). Fix- 
ing the TOCTTOU flaws is usually straightforward. For 
example, the race window created by stat followed by 
chmod can be fixed by first opening the file to get its file 
descriptor, and then using fstat and fchmod on the 
file descriptor. 


3.4 Buffer overflows 


By automatically inserting array bounds checks, Java 
provides built-in protection against buffer overflows. If 
a program is developed in pure Java, we are rest assured 
that no buffer will be overflowed. However, since the 
implementation of the JDK contains C/C++ code, it is 
possible for an attacker to pass Java applications unex- 
pected values, which flow to the C code in the JDK and 
trigger a buffer overflow. 

Buffer overflows occur when a C program does not 
perform sufficient bounds checking. Native methods that 
use the JNI often need to check integers from Java for 
negative values. Since Java supports only signed integer 
types, the JNI maps all Java integer types to signed in- 
teger types in C. To use these signed integers safely for 
indices, sizes, and loop counters that should never have 
negative values, explicit checks are necessary. Missing 
checks for negative values may crash the JVM, as past 
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bug reports have shown [19, 38]. 

We employed ITS4, Splint, and Flawfinder to scan the 
C files under the target directories for buffer-overflow 
bugs. ITS4 and Flawfinder scan for and report danger- 
ous operations such as strcpy, memcpy, and fscanf. 
Splint reports many type-incompatibility warnings. For 
example, it issues a warning when a signed integer is 
used as an unsigned integer, which is helpful to identify 
missing checks for negative values. With the help of the 
static analysis tools, we discovered seven places where 
there are insufficient bounds checks. Two of them are in 
C functions that are not used by Java, and do not pose a 
security threat to the JVM. The rest pose real threats to 
the JVM: one bug is due to a missing width specifier in 
the format-string argument of fscanf; three bugs are 
due to possible integer overflows that may subsequently 
lead to buffer overflows; one bug is due to insufficient 
bounds checking of a public native method.! 


3.5 Bugs related to dynamic memory man- 
agement 


The C code in the JDK needs to manage two memory re- 
gions, the C memory region and the Java memory region. 
It may mismanage both memory regions. 


Dynamic memory management in C. Unlike Java, 
the C language provides programmers the power of 
manually managing memory through functions such as 
malloc and free. This power, which seems indis- 
pensable in system programming, has always been a con- 
stant source of programming defects, and consequently 
security vulnerabilities. Due to manual memory man- 
agement, the C code in the JDK may suffer from a range 
of flaws, including dereferencing dangling pointers, mul- 
tiple frees, and memory leaks. These defects may make 
the JVM unstable and vulnerable. 

We employed Splint to identify defects related to 
memory management in the target directories. We man- 
ually inspected a large number of warnings and found 
only one case of memory leaks. 


Managing Java memory through the JNI. Through 
the JNI, native methods can manage the Java mem- 
ory. Certain JNI APIs manage Java memory in a style 
similar to malloc and free. For instance, to ac- 
cess a Java integer array, a native method first invokes 
GetIntArrayElements to have a pointer to the in- 
teger array. When the method finishes with the array, it 
is supposed to invoke Re leaseIntArrayElements 
to release the pointer. These JNI API functions en- 
able the C method to communicate with Java’s Garbage 
Collector (GC). Get IntArrayElements informs the 
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GC of the creation of a C pointer to the Java array; 
the GC should not garbage collect or move the array. 
ReleaseIntArrayElements informs the GC that 
the C pointer is no longer needed. 

This style of manual memory management is er- 
ror prone and has similar problems to the ones of 
malloc/free. For example, using the C pointer after 
ReleaseIntArrayElements is similar to using a 
dangling pointer, since the Java GC may have already 
moved or garbage collected the array. Failure to invoke 
ReleaseIntArrayElements will make the GC 
retain the array indefinitely. Other pairs of functions that 
are similar to Get /ReleaseIntArrayElements 
include Get/ReleaseStringUTFChars, New/ 
DeleteGlobalRef, and Push/PopLocalFrame. 

We developed grep-based scripts to pattern match 
places where relevant JNI API functions such as 
ReleaseIntArrayElements are used. In the 
target directories, we discovered one place where 
ReleaseStringUTFChars is not invoked (in one 
control-flow path) to release a Java String reference. 
There are also 27 places where JNI global references are 
not released.” Although these bugs are not security criti- 
cal, they result in memory leaks and are worth fixing. 


3.6 Insufficient error checking 


One of the most common mistakes when writing C code 
is missing checks for error cases. Since the C language 
does not have an exception mechanism, programmers are 
required to perform explicit checks after many function 
calls that may return special values for reporting errors. 
For instance, the standard malloc function returns a 
null value if the required space cannot be allocated. The 
correct usage of the malloc function should first check 
the return value for nonnull before using it. We encoun- 
tered two places where the C code in the target directo- 
ries forgets the check for the malloc function. 

In addition, many JNI API functions use null values 
to report errors. For example, the Get FieldID func- 
tion returns null when the operation fails.> The following 
code crashes Sun’s JVM, when fid gets null. 


//Get the field ID 

fid=(«*env) -> 
GetFieldID(env, cls, 

//Get the int field 

int i=(*env)->GetIntField(env, obj, fid); 


Wen nyu) ; 


The above code should first check fid to be nonnull, 
before invoking Get Int Field. 

We developed grep-based scripts to scan for JNI API 
functions whose return values should be checked. We 
inspected suspicious JNI API calls to check whether their 
return values are checked before used. In total, we found 
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JNI API FUNCTIONS # OF VIOLATIONS 


GetFieldID/GetStaticFieldID 5 
GetMethodID/GetStaticMethodID 3 
GetStringUTFChars 4 
FindClass 11 
New (Type) Array 1 
NewGlobalRef 11 
Total 35 


Table 2: Insufficient error checking. For each JNI API, 
the table lists the number of cases in the target directories 
where there is no checking of the return value of the API 
before using the value. 


35 violations. Table 2 summarizes the results in the target 
directories. Note that the table does not include those JNI 
API functions for which we did not find violations. We 
consider insufficient error checking to be security critical 
because they may result in JVM crashes. 


3.7 Other flaws resulting from misusing 
the JNI 


For completeness, we next mention other bug patterns in 
the native code of the JDK. For these patterns, we either 
have not found any bugs in the target directories, or have 
not successfully applied static analysis tools. 


Type misuses. The JNI maps Java types to C/C++ 
types and performs necessary conversions when data go 
through the interface. Java primitive types and Java ref- 
erence types are mapped differently. Java primitive types 
are mapped directly. For example, the Java type int 
is mapped to the native type j int (declared as 32-bit 
integers in jnih). Java objects of reference types are 
mapped to opaque references, which are pointers to in- 
ternal data structures in the JVM. The exact layout of the 
internal data structures is hidden from programmers. In 
C, all opaque references have the type jobject. Na- 
tive C/C++ code manipulates these references through 
JNI API functions. 

Since native C code treats all references to Java objects 
as having one single type*, C compilers cannot distin- 
guish references to objects of different Java classes. As a 
result, an object of Java class A may be wrongly passed to 
a JNI API function that actually requires an object of Java 
class B. Type checking in C compilers cannot catch this 
kind of mistakes, which usually results in JVM crashes. 
More serious is the case that a native method invokes a 
Java method with objects of wrong classes. A type con- 
fusion like this could have serious consequences, as past 
research on Java security has shown [32, 6]. 
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Another case of type misuses in the JNI is that pro- 
grammers may invoke wrong JNI API functions. For ex- 
ample, programmers may use wrong JNI array APIs, as 
the JNI provides different APIs for accessing arrays of 
different types. There are Get ByteArrayElements, 
Get IntArrayElements, and others. Calling wrong 
JNI API functions may result in improper memory ac- 
cesses or JVM crashes. 

JSaffire [15] by Furr and Foster is a tool that can check 
type misuses in the JNI code. We did not incorporate 
JSaffire into our step of characterizing bug patterns for 
two reasons. First, this category of bugs has been well 
characterized in previous work [15, 34]. Second, we sus- 
pect type-misuse bugs in the JDK’s native code would be 
rare. Type-misuse bugs usually result in immediate pro- 
gram crashes and are easy to trigger with a small amount 
of test code. As the JDK has been extensively “tested” 
by its users, we believe that most of the type-misuse bugs 
have been fixed. This is partly confirmed by our experi- 
ment. We constructed scripts to search for the most com- 
mon cases of type-misuse bugs, such as passing wrong 
classes to JNI API functions and confusing jclass with 
jobject [24, ch 10.3]; we did not find any such kinds of 
bugs in the target directories. 


Deadlocks. The JNI includes pairs of functions 
Get /ReleaseStringCritical and Get / 
ReleasePrimitiveArrayCritical, which 


introduce a critical region. Inside the region, the C code 
cannot issue blocking calls or allocate new Java objects. 
Otherwise, the JVM may deadlock. We inspected all 
such critical regions in the target directories and did not 
find any risk of deadlock. 


Violating the Java security model. The JNI does not 
enforce access controls on classes, fields, and methods 
that are expressed in the Java language through the use of 
modifiers such as private. Therefore, a native method 
can read private fields of any Java object. Furthermore, 
a native method can violate the Java sandbox security 
model, by performing dangerous operations that would 
otherwise be blocked by the JVM. We have not checked 
the JDK’s native code for these kinds of violations. 


4 Remedies, limitations, and directions 


The native code inside the JDK is critical to Java secu- 
rity. As we and others have demonstrated, after more 
than a decade, there are still flaws remaining in this crit- 
ical code. Once identified, these flaws are generally not 
hard to fix. However, the perpetual mode of patching is 
less than satisfying. Next we discuss more systematic 
approaches, their limitations, and future directions. 
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4.1 Static analysis 


Static analysis is useful for isolating and eliminating se- 
curity bugs, as demonstrated by the number of bugs we 
identified with the help of static analysis tools. On the 
other hand, there are a few limitations of the current gen- 
eration of static analysis tools that prevent us from using 
them to cover all 800k lines of native code in the JDK. 


Limitations of static analysis tools. The tools we used 
issued a large number of warnings that are false positives. 
For each of the three off-the-shelf tools, the following 
table lists the number of warnings it issued, the number 
of true errors, and its false-positive rate. 


Off-the-Shelf Tools FP rates 





TTS4 ar] 6 [97% 
Flawfinder 98.3% 
Splint WWE 


Our own scripts and scanners perform slightly better, 
but the false-positive rates are still high; see Table 3. 
Due to the large number of false positives, we had to 
manually sift through many cases—the principal reason 
why we examined only a portion of the native code in 
the JDK. In addition to false positives, static analysis 
tools may have false negatives. For example, of the four 
buffer-overflow bugs identified in the target directories, 
ITS4 and Flawfinder missed one and Splint missed two. 
Another limitation of the static analysis tools is that 
they analyze C code alone, without considering how the 
Java side interacts with the C side. This is a severe lim- 
itation because the interface code between Java classes 
and C libraries is where most bugs arise. In fact, all 
the bugs we identified are in the interface code. This 
is not only because the two libraries in the target direc- 
tories (namely, Zlib and fdlibm) have been used in many 
other applications besides the JDK and are mature, but 
because programmers tend to make wrong assumptions 
of the Java and C sides when writing interface code. 
When analyzing the interface code, considering both 
sides of Java and C can significantly increase analysis 
precision and reduce false positives and negatives. To 
illustrate, we use the java.util.zip.Deflater 
class as an example. The public deflate method 
shown in Figure 4 accepts a buffer, an offset, and a 
length from users, and then invokes the native method 
deflateBytes. To be safe, the deflate method 
checks the bounds of the offset and the length parameters 
before invoking the native method deflateBytes. 
For the example in Figure 4, a static analysis that 
analyzes only C code has to make either an optimistic 
or a pessimistic assumption about whether the Java 
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side has performed the bounds checking. If the anal- 
ysis makes the optimistic assumption, it would pro- 
duce false negatives if the Java side had forgotten to 
check the bounds. If it makes the pessimistic assump- 
tion, it would have to flag any access to the b buffer 
through the offset and the length as a possible error. 
For example, the Set Byt eArrayRegion operation in 
deflateBytes would be flagged as a possible out-of- 
bounds array write, even though that is impossible given 
the Java context. Bug finders usually make pessimistic 
assumptions for the purpose of minimizing false neg- 
atives. For instance, Splint flags “malloc (len)” in 
deflateBytes and complains about an incompatible 
type cast from the signed integer 1en to an unsigned in- 
teger expected by mal 1loc—it does not know that the 
Java side invokes deflateBytes only with positive 
lengths. 

The necessity of inter-language analysis is also sharply 
enforced by our experience of manual inspection. For 
many warnings, we inspected both their C and Java con- 
texts to decide if they are true errors. To give a rough idea 
of how many warnings cannot be eliminated as false pos- 
itives without taking the Java context into account, we 
examined the 139 incompatible-type-cast warnings that 
Splint issued for the C code under java.util.zip 
and found that in 22 cases the Java context must be in- 
spected. 


Future directions of improving static analysis tools. 
Some of the limitations we mentioned are particular to 
the tools we used, and are not fundamental to static anal- 
ysis. The off-the-shelf tools used in this study are known 
for having high false-positive rates. ITS4, Flawfinder, 
and our own tools are based on simple syntactic pat- 
tern matching; Splint performs certain type-based anal- 
yses, but is still a coarse-grained tool. We believe false- 
positives rates can be significantly reduced through ad- 
vanced static techniques such as software model check- 
ing (e.g., MOPS [4], CMC [29], SLAM [2]), and Ban- 
dera [7]), type qualifiers [12, 17], theorem proving tech- 
niques (e.g., ESC/Java [11]), and others. 

To better analyze interface code, we advocate inter- 
language analysis across Java and C. Most existing tools 
are limited a priori to code written in a single lan- 
guage. Few inter-language analyses across Java and C 
exist. JSaffire [15] is an exception, but can only check 
for type misuses of data from Java to C. Our previous 
work, ILEA [36], enables general inter-language analy- 
sis across Java and C. The basic approach of ILEA is to 
perform a partial compilation from C code to a specifi- 
cation based on Java so that an existing Java analysis can 
understand the behavior of the C code through the Java 
specification. ILEA extends Java with a set of simple, yet 
powerful approximation primitives, which enable auto- 
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BUG PATTERNS 
Unexpected control flows due to 
mishandling exceptions 
C pointers as Java integers 
Mem. management flaws (Java Mem.) 
Insufficient error checking (JNI APIs) 





FPR 
grep-based scripts 337 11 96.7% 
scanner builtinCIL | 46] ~~ 38 | 17.4% 
grep-based scripts | 43] ~— 28 | 34.9% 
grep-based scripts =| == 230 | 35 | 84.8% 


Table 3: False-positive rates of our tools. 


java.util.zip.Deflater 


public class Deflater { 


public synchronized int deflate(byte[] b, 


int off, int len) { 


if (off < 0 || len < 0 || off > b.length - len) { 
throw new ArrayIndexOutOfBoundsException() ; 


} 


return deflateBytes(b, 


: 


off, len); 


private native int deflateBytes(byte[] b, 


} 


C implementation of deflateBytes() 


int off, int len); 


jint Java_java_util_zip Deflater deflateBytes 


(JNIEnv xenv, jobject this, 


out_buf = (jbyte *) malloc(len) ; 


(*env) ->SetByteArrayRegion(env, b, 


jarray b, 


off, len - strm->avail out, 


jint off, jint len) { 


out_buf) ; 


Figure 4: An example illustrating the necessity of inter-language analysis 


matic extraction of partial Java specifications of C code. 
Through ILEA, any existing analysis on Java in principle 
can be extended to also cover C code. In practice, how- 
ever, ILEA is restricted by its compilation precision, and 
also by the effectiveness of the Java analysis. 


We plan to combine advanced static analysis tech- 
niques with the ideas in ILEA to build high-precision, 
inter-language tools that hunt for bugs in the JDK’s na- 
tive code. We are particularly interested in taint analysis 
and software model checking. Static taint analysis (e.g., 
[26]) can track attacker-controllable data that flow from 
Java to C. Software model checking can check for viola- 
tions of many patterns we have discussed as they can be 
formalized as state machines. We plan to investigate C 
model checkers such as MOPS [4] and CMC [29] and ex- 
tend them to perform inter-language checking using the 
ideas in ILEA. 


Finally, we believe it is important to formalize the 
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soundness proofs of static analysis tools. Formal study 
helps understand the assumptions, clarify guarantees, 
and reduce false negatives. In the context of the JNI, for- 
mal study is complicated by the lack of formal semantics 
of the C language. It is perhaps helpful to focus instead 
on a well-defined subset of C such as Cminor [23]. 


4.2 Dynamic Mechanisms 


Static analysis analyzes programs to find implementation 
errors before the programs are run. An alternative is to 
use dynamic mechanisms to prevent or isolate errors dur- 
ing runtime. Dynamic mechanisms can take advantage 
of richer runtime information to check certain properties 
easily, although sacrificing some performance. 

Our previous work, SafeJNI [34], is a mostly dynamic 
mechanism for ensuring the safety of JNI-based pro- 
grams such as the JDK. It first leverages CCured [31] 
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to provide internal memory safety to the C code. CCured 
analyzes C programs to identify places where memory 
safety might be violated and then inserts runtime checks 
to ensure safety. SafeJNI also inserts runtime checks at 
the boundary between Java and C to make sure that the C 
code accesses the Java state safely and cooperates with 
Java’s garbage collector. SafeJNI incurs a performance 
overhead of 14—119% on a set of microbenchmark pro- 
grams, and incurs 63% on Zlib. 

Table 4 summarizes how SafeJNI protects Java from 
bugs in the native code in terms of the various bug 
patterns discussed before. SafeJNI protects Java from 
most kinds of bugs in the native code. Its main lim- 
itation is that it does not protect against concurrency- 
related bugs (race conditions and deadlocks); we be- 
lieve concurrency-related bugs should be best addressed 
through advanced static analysis techniques. 


Future directions. We believe that SafeJNI is a 
promising direction to prevent errors in the native code. 
We plan to reduce its overhead in two ways. First, static 
analysis techniques can reduce a large number of dy- 
namic checks. For example, many runtime type checking 
can be eliminated if we can statically track the classes of 
Java objects in C, similar to what JSaffire does [14]. 

Second, we plan to explore other more efficient ways 
of providing internal safety to C code than CCured. Our 
experiment showed that CCured accounted for most of 
the performance overhead in SafeJNI (46% out of 63% 
in Zlib). The relatively large performance slowdown is 
because CCured guarantees every C buffer is well pro- 
tected. For instance, given the code below 


int *p = (int *) malloc (1024); 
*(pt+ti) = 3; 


CCured in general will insert the runtime check “0 <=i 
< 1024” before “*(p+i) = 3”. 

If the safety policy is to protect the JVM state from 
being accidentally destroyed by C code, then Software 
Fault Isolation (SFI [40, 27]) of the C code is sufficient. 
Whenever the JVM starts to execute a native method, it 
can first allocate a trunk of memory, say 16MB, and hand 
the memory region to the native method. A SFlI-based 
scheme can then guarantee that any access of the C mem- 
ory will not escape the memory region, and thus will not 
destroy the JVM state. 

Schemes based on SFI can isolate errors within native 
components, but does not prevent exploits of vulnerabil- 
ities inside the components. XFI [9], on the other hand, 
can prevent exploits of a large number of vulnerabilities 
by enforcing properties such as control-flow integrity. In 
addition, it works on assembly code and is not restricted 
to a source programming language. 
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4.3 Reimplementation in safer languages 


It can be argued that the C language is intrinsically un- 
safe and should not be used in the JDK. In the long run, 
we believe the C code in the JDK should be reimple- 
mented in safer languages. The obvious choice is Java. 
This is a feasible approach, as there exist implementa- 
tions in pure Java of many programs originally written 
in C, such as the Zlib library [20]. GNU Classpath, an 
open-source replacement of Sun’s JDK, takes this ap- 
proach seriously; one of their long-term goals is to be- 
come JNI independent by implementing everything in 
Java [5]. On the flip side, rewriting the existing 800 kloc 
of C/C++ code in Java will require a substantial invest- 
ment, and will likely have a negative impact on execution 
speed. 

Another idea is to use a safe C variant to port the 
C code. Cyclone [21] is a reasonable choice. Since 
the syntax and semantics of Cyclone are close to C, 
porting C code to Cyclone should take less time than, 
say, a complete rewrite in Java. However, as Cyclone 
has a strong type system and uses region-based mem- 
ory management, converting to type-checkable Cyclone 
code will not be a trivial effort. Furthermore, this ap- 
proach alone can guarantee only the internal safety of C 
code. The C code can still misuse the JNI interface. 

Since the JNI interface is extraordinarily verbose and 
error prone, one approach to reducing flaws is to use a 
better interface between Java and C. A notable exam- 
ple is Jeannie [18], which allows programmers to write 
mixed Java and C code in a single file. The Jeannie com- 
piler then translates mixed Java/C code into code that 
uses the JNI. Although in Jeannie it is still possible to 
write unsafe code, Jeannie helps programmers reduce er- 
rors. For example, in Jeannie programmers can raise Java 
exceptions directly, thus avoiding the control-flow prob- 
lem when raising JNI exceptions (Section 3.1). 


5 Conclusion 


The large amount of native code in the JDK is a time 
bomb in Java security. Our study has examined a range of 
bug patterns in the JDK’s native code, from well-known 
buffer overflows to new patterns such as unexpected con- 
trol flow paths due to mishandling JNI exceptions. Given 
the importance of Java, it is imperative to develop better, 
inter-language static and dynamic mechanisms to medi- 
ate the threats posed by the native code. 

Through our study, we hope to send the message that 
the native code should be kept at a minimum in the JDK. 
On the contrary, the native code in Sun’s JDK has been 
on the increase. The native code is outside of the Java 
security model and defeats Java’s main goals: safety, se- 
curity, and platform independence. In the long run, most 


17th USENIX Security Symposium = 375 


376 


BUG PATTERNS How SAFEJNI WORKS AGAINST THE BUGS? 
Unexpected control flows due to | Through SafeJNI’s dynamic checks on pending exceptions. 


mishandling exceptions 


Race conditions in file accesses N/A 

Buffer overflows Through CCured and SafeJNI’s static pointer kind system. 
Mem. management | C mem. Through CCured. 

flaws Javamem. | Through SafeJNI’s memory management scheme. 
Insufficient error JNI APIs Through SafeJNI’s dynamic checks. 

checking misc. Through CCured. 

Type misuses Through SafeJNI’s dynamic checks. 

Deadlocks N/A 


Violating the Java security | Partly addressed through SafeJNI’s dynamic checks on access- 
model control rules on Java fields/methods. 





Table 4: How SafeJNI protects the JVM from bugs? 


of the native code should be ported to safer languages 
such as Java. 
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Notes 


!This bug is not in the target directory and was found in a casual 
inspection. 

2Global references are never released in the code we examined, al- 
though the JNI manual explicitly mentioned the necessity of freeing 
global references [24, ch5.2.3]. 

3Tt fails if the specified field cannot be found, or if the class initial- 
izer fails, or if the system runs out of memory [24]. 

4In C++, certain Java built-in classes have corresponding C++ 
classes in the JNI (predefined in jni.h). References to objects of other 
Java classes, including all user-defined classes, are still mapped to 
jobject. 

5 With the options “+posixlib -paramuse -redef -noeffect -varuse 
-exportlocal -incondefs -booltype jboolean -booltrue JNI-TRUE - 
boolfalse JNI-FALSE -predboolint -compdef”. 


17th USENIX Security Symposium 377 


AutoISES: Automatically Inferring Security Specifications and Detecting 
Violations 


Lin Tan! 


University of Illinois, Urbana-Champaign 


lintan2 @cs.uiuc.edu 


Xiaolan Zhang 
IBM T.J. Watson Research Center 
cxzhang @us.ibm.com 


Xiao Ma'?, Weiwei Xiong', Yuanyuan Zhou!? 


"University of Illinois, Urbana-Champaign 


* Pattern Insight Inc. 


{xiaoma2, wxiong2, yyzhou} @cs.uiuc.edu 


Abstract 


The importance of software security cannot be over- 
stated. In the past, researchers have applied program 
analysis techniques to automatically detect security vul- 
nerabilities and verify security properties. However, such 
techniques have limited success in reality because they 
require manually provided code-level security specifica- 
tions. Manually writing and generating these code-level 
security specifications are tedious and error-prone. Ad- 
ditionally, they seldom exist in production software. 

In this paper, we propose a novel method and tool, 
called AutoISES, which Automatically Infers Security 
Specifications by statically analyzing source code, and 
then directly use these specifications to automatically de- 
tect security violations. Our experiments with the Linux 
kernel and Xen demonstrated the effectiveness of this ap- 
proach — AutoISES automatically generated 84 security 
specifications and detected 8 vulnerabilities in the Linux 
kernel and Xen, 7 of which have already been confirmed 
by the corresponding developers. 


1 Introduction 


1.1 Motivation 


The critical importance of software security has driven 
the design and implementation of secure software sys- 
tems. Security-Enhanced Linux (SELinux) [23, 28], de- 
veloped as a research prototype to incorporate Manda- 
tory Access Control (MAC) into the Linux kernel sev- 
eral years ago, imposes constraints on its existing Dis- 
cretionary Access Control (DAC) for stronger security. 
SELinux has since been adopted by the mainline Linux 
2.6 series and incorporated into many commercial dis- 
tributions, including Redhat, Fedora, and Ubuntu. Re- 
cently, Xen also adopted a similar MAC security archi- 
tecture to enable system-wide security policy [7]. 
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A core part of such access control systems is 
a set of security check functions, which check 
whether a subject (e.g., a process) can perform a 
certain operation (e.g., read or write) on an object 
(e.g., a file, an inode, or a socket). These pro- 
tected operations are called security sensitive opera- 
tions. For example, Linux’s security check function 
security_file_permission((file), ...) can deter- 
mine if the current process is authorized to read or 
write the file, while another security check function, 
security-filemmap((file), ...), checks if the cur- 
rent process is authorized to map a file into mem- 
ory. To ensure only authorized users can read or write 
the file, developers must add the security check func- 
tion security_file_permission() before each file 
read/write operation on every file. Similarly, develop- 
ers must add security_file-mmap() each time before 
mapping a file to memory, to ensure only authorized 
users can memory map the file. 


A major challenge of supporting the secure architec- 
ture above is to ensure that all sensitive operations on all 
objects are protected (i.e., checked for authorization) by 
the proper security check functions in a consistent man- 
ner. If the proper security check function is missing be- 
fore a sensitive operation, an attacker with insufficient 
privilege will be able to perform the security sensitive op- 
eration, causing damage. For example, the file read/write 
operation is performed in many functions throughout 
the Linux kernel, such as read(), write(), readv(), 
writev(), readdir(), and sendfile(). Despite the 
different names of these function calls, they all per- 
form the same conceptual file read/write operation, and 
must be checked for authorization by calling the security 
check function security_file_-permission(). As 
the Linux kernel code is reasonably mature, most of these 
functions performing the file read/write operation, such 
as read(), write(), and readdir(), are protected by 
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linux/fs/read_write.c: linux/fs/read_write.c: 


ssize_t vfs_read(...) { ssize_t vfs_write(...) { 
ret = security file_permission (file, ...); ret = security_file_permission (file, ...); 


// performs file read/write operation 
ret = file->f_op->write(file, ...); 


// performs file read/write operation 
ret = file->f_op->read(file, ...); 

















linux/fs/readdir.c: linux/fs/read_write.c: 
static ssize_t do_readv_writev(...) { ... 
~ Forgot to call 
security_file_permission()! 


— A security violation. 


ssize_t vis_readdir(...) { 


ret = security file_permission (file, ...); 


// performs file read/write operation // performs file \ 
ret = file->f_op->readdir(file, ...); // read/write operation 
ret = file->f_op->readvifile,); ... 











7 





(a) file read/write operation 
protected by the check 


(b) file read/write operation 
protected by the check 


(c) file read/write operation 
protected by the check 


(d) Security Violation - violating the 
implicit security specification 


Figure 1: A real security violation in Linux 2.6.11. The security check security-filepermission() was missing before 
the security sensitive operation performed via file->f_op-—>readv( ), violating the implicit security specification — every 


file read/write operation must be checked for authorization using security-filepermission(). 


This is a real security 


violation, which has already been fixed in later versions. The code is slightly modified to simplify illustration. 


the security check function, as shown in Figure 1(a)-(c). 
However, in a few other cases, as shown in Figure 1(d), 
the security check function is not invoked before the file 
read/write operation performed by readv(), violating 
the implicit security specification or security rule: every 
file read/write operation must be protected by calling se- 
curity check function security-filepermission(). 
Due to this real world security vulnerability in Linux 
2.6.11 (CVE-2006-1856 [1]), unauthorized user can read 
and write files that they are not allowed to access, poten- 
tially providing unauthorized user account access. Addi- 
tional damages might include partial confidentiality, in- 
tegrity, and availability violation, unauthorized disclo- 
sure of information, and disruption of service. 


There have been great advances in applying program 
analysis techniques [2, 4, 5, 12, 16] to automatically de- 
tect these security vulnerabilities and to verify security 
properties [6, 9, 18, 30]. Generally, these tools take a 
specification that describes the security properties to ver- 
ify as input. For example, in earlier efforts [9, 30], the 
authors manually identified the data types (e.g., struct 
file, struct inode, etc.) that might be accessed to 
perform security sensitive operations and automatically 
verified that any access to these data types was protected 
by a security check function. Although these previous 
studies detected some vulnerabilities, and made signifi- 
cant progresses toward automatic verification of security 
properties, they are limited in two perspectives: 


e All these previous tools require developers or their 
tool users to provide code-level security specifi- 
cations, which greatly limit their practicability in 
checking and verifying security properties. Writ- 
ing specifications that accurately capture the secu- 
rity properties of a piece of software and at the same 
time maintaining their correctness across different 
versions of the software is notoriously difficult. Such 
specifications seldom exist in production software. 
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e Human-generated specifications can be imprecise, 
causing false positives and potentially false negatives 
in violation detection. As an example, the specifi- 
cation used in one of the earlier work [30], intro- 
duced false positives because it treated any access 
to specified data structures as security sensitive op- 
erations. In reality, a security sensitive operation 
typically consists of accesses to multiple data struc- 
tures: a file read/write operation involves accessing 
struct file, struct inode, struct dentry, 
etc. Accessing the file structure alone is not neces- 
sarily (actually in most cases is not) a file read/write 
operation. In addition, some field accesses (e.g., 
file->f_version) ofa security sensitive data type 
are not part of any security sensitive operation, and 
therefore do not need to be protected. Therefore, in 
the two cases above, simply requiring accesses to ev- 
ery field of these data structures to be protected led 
to false positives [30]. The specification may also 
introduce false negatives because it does not spec- 
ify which security check is required for which op- 
eration. The tool can fail to detect violations where 
the wrong check function is used as different security 
sensitive operations (e.g., file read/write and memory 
map) may access the same data types (e.g., struct 
file) but require different security checks. 


Therefore, to design tools that are truly usable for or- 
dinary programmers, it is highly desirable for these tools 
to meet the following three requirements on specification 
generation: (1) to automatically check against source 
code for security violations, the security specifications 
must be at the code level. The conceptual specification 
“file read/write operation must be protected by the secu- 
rity check function security-file_permission()” 
can not be checked against source code without know- 
ing its corresponding code-level representation. (2) as it 
is tedious and error-prone for developers to write these 
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Security Check Function: security_file_permission 
Security Sensitive Operation (A group of data 
structure accesses): 


. READ file->f_dentry->d_inode 
. READ file->f_vfsmnt 

. READ dentry->d_inode 

. READ address_space->flags 
. READ address_space->nrpages } 
10. WRITE address_space->nrpages 
11. READ address_space->page_tree 


OMONOOAARWN = 


17. WRITE page->mapping 

18. READ pglist_data->node_zonelists 
19. READ zone->wait_table 

20. READ zone->wait_table_bits 

21. READ (Global) nr_pagecache 

22. READ (Global) zone_table 















linux/fs/read_write.c: 


- READ inode->i_size static size_t do_readv_writev(...){_ —generic_file_aio_read), ... 

. READ file->f_flags } “~~ 

_ READ file->f_pos = . rs mW _- — bos 

_ READ file->f_dentry // file->f_op->readv is set to function | 4~ ~~ 


generic_file_readv(). / 
tet = file->f_op->readv(); 4 


12. READ address_space->tree_lock a Ss i_size_read\);...} ~~ 

13. READ page->_count Feary fe permission) is ~~ = _ 

14. READ page->flags oe : : — AS aa, 

15. WRITE page->index missing before the file read/write _ 
16. READ page->mapping | Operation, which accesses data | linux/include/linux/fs. h: 


structures in (a) in many different 
functions and files. 


—— 


linux/mm/filemap.c: 
ssize_t generic_file_readv(...) { ... 





linux/mm/filemap.c: 

ssize_t __generic_file_aio_read(...) { 
struct file *filp = iocb->ki_filp; ... 
if (filp->f_flags ...) {... 
} CREAD file->f_flags> 








static inline loff_t i_size_read(struct 

inode *inode) { ... 

returninode->i_size; 
READ inode->i_size 














(a) Security Rule/Specification 


(b) Security Violation 


(c) Accesses in the Code 


Figure 2: A code-level security specification AutoISES automatically generated and a real security violation to the specification 
in Linux 2.6.11. The leftmost box shows the security rule, consisting of a security check function and a group of data structure 
accesses. Each row is one access, which can be either a structure field access or a global variable access, denoted by “(Global)”. 
For each structure field access, the name before the first —> is the type name of the structure, and the rest are field names. For a 
global variable, the variable name is used. The code is slightly modified to simplify illustration. 


security rules, the tool should automatically generate se- 
curity specifications with minimum user/developer in- 
volvement; and (3) the generated specification must be 
precise, otherwise it would result in too many false posi- 
tives and/or false negatives. 


1.2 Our Contributions 


This paper makes two contributions: 


(1) We propose an approach and a tool, AutoISES, 
to automatically extract concrete code-level security 
specifications from source code and automatically de- 
tect security violations to these specifications. Our 
key observation is that although the same security sen- 
sitive operation can be performed in different functions, 
ultimately, the structure fields and global variables these 
functions access are the same. We call these structure 
field and global variable accesses together as data struc- 
ture accesses. For example, all of the different func- 
tions performing the file read/write operation share the 
22 data structure accesses listed in Figure 2(a) (automat- 
ically generated by AutoISES), including reading field 
f_flags of the file structure and reading field i_size 
of the inode structure. These 22 data structure accesses 
are performed in many different functions located in dif- 
ferent source files. Intuitively, this makes sense as secu- 
rity check functions are designed to protect data. There- 
fore, the use of data structure accesses is fundamental in 
representing security sensitive operations. 
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Based on this observation, we propose a method and a 
tool, called AutoISES, to Automatically Infer Security 
Specifications by statically analyzing source code and 
directly use these specifications to automatically detect 
security violations. Specifically, if a code-level security 
sensitive operation is frequently protected by a security 
check function in source code, AutoISES automatically 
infers that the security check function should be used to 
protect the particular code-level security sensitive opera- 
tion. Our rationale is that for release software, the ma- 
jority of the code should be correct, therefore we can use 
the code to infer security specifications or rules, which 
are observed in most places of the source code, but may 
not in a few other places. The rationale is similar to that 
of prior work in specification mining [10, 11, 20, 22], 
each of which extracts different types of programming 
rules automatically from source code or execution trace. 
However, previous techniques are not directly applica- 
ble to our problem, because they are limited by the 
types of rules they can infer (e.g., function correlation 
rules [10, 20], variable value related invariants [11], vari- 
able pairing rules [22]). As noted previously, our key ob- 
servation states that the sensitive operation should be rep- 
resented as data structure accesses, therefore, it requires 
AutoISES to be able to learn specifications that contain 
both functions and multiple variable accesses that satisfy 
certain constraints. None of the previous techniques can 
be applied without significant re-design of the learning 
algorithm (more detailed discussion in Section 6.1 and 
Section 7). 
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We evaluated AutoISES on the latest versions of 
two large software systems, the Linux kernel and 
Xen, to demonstrate the effectiveness of our approach. 
AutoISES automatically extracted 84 rules from the 
Linux kernel and Xen, and detected 8 true violations, 7 of 
which are confirmed and fixed by the corresponding de- 
velopers. Figure 2 shows (a) the code-level security spec- 
ification learned by AutoISES which consists of the 22 
data structure accesses, (b) a security violation automati- 
cally detected by AutoISES, and (c) the unprotected sen- 
sitive operation that performs all the accesses shown in 
(a) in different functions located in various source files. 
It would be very difficult, if impossible, for a human be- 
ing to generate such a specification. More examples and 
results can be found in Section 5. 

The automatically generated specifications can also 
be used by other analysis tools for vulnerability detec- 
tion. Additionally they can assist in software understand- 
ing and maintenance. These results demonstrate that 
AutoISES is effective at automatically inferring secu- 
rity rules and detecting violations to these rules, which 
greatly improves the practicality of security property 
checking and verification tools. 


(2) We quantitatively evaluate rule granularity im- 
pact on the accuracy of rule inference and vi- 
olation detection. Security specifications can vary 
in granularity. For example, a single access 
can be represented with the access type (read or 
write), READ inode->i_size, or without it, ACCESS 
inode->i_size. Similarly, the same access can be rep- 
resented with the structure field, READ inode->i-_size, 
or without it, READ inode. Theoretically, finer granu- 
larity causes more false negatives and fewer false pos- 
itives for violation detection compared to coarser gran- 
ularity. The choice of granularity can greatly affect the 
accuracy of rule inference and violation detection. Al- 
most all previous rule generation and violation detection 
techniques [3, 9, 10, 11, 20, 22, 30] choose fixed gran- 
ularity without quantitatively evaluating how good their 
choice is. 

In our work, we quantitatively evaluate the impact of 
different rule granularity on rule inference and violation 
detection. This approach is orthogonal to our automatic 
rule inference techniques and can be applied to other rule 
extraction techniques. 

Interestingly, our results 
do not distinguish the fields, 
code-level security sensitive operation for the 
check function security_inode_link() and 
security-file.unlink() is the same, failing to 
distinguish the two different operations. Using our 
finest granularity, AutoISES can disambiguate the 
two similar operations (the unlink operation contains 














if we 
inferred 


show _ that 
then the 
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READ inode->i-_size, but the link operation does 
not). Our results also show that on average our most 
fine-grained rules causes 33% fewer false positives than 
the most coarse-grained rules, and detected all of the true 
violations that the most coarse-grained rules can detect. 
This indicates rule granularity significantly affects 
violation detection accuracy and could be considered a 
tuning parameter for other rule inference and violation 
detection tools to reduce false positives. 

On the other hand, coarse-grained rules help us dis- 
cover high level rules that are shared by different secu- 
rity check functions, which fine-grained rules fail to un- 
cover (examples shown in Section 5.3). These results 
call for attention that different levels of granularity have 
measurable advantages and disadvantages, and one could 
quantitatively evaluate the tradeoffs when designing rule 
inference and violation detection tools in order to choose 
the most suitable granularity. 


In summary, AutoISES closes an important gap in 
achieving secure software systems. To have truly secure 
software systems, not only must one have a secure de- 
sign, but the implementation must faithfully realize the 
design. To verify that the implementation faithfully real- 
izes the design, one must write a correct code-level spec- 
ification which can be verified by automatic tools such as 
a model checker or a static analyzer. AutoISES allows 
the security specifications to be automatically extracted 
from the actual implementation, alleviating the develop- 
ers from the burden of manually writing specifications 
while at the same time significantly improving the accu- 
racy of the specification. 


1.3 Paper Layout 


The remainder of the paper is organized as follows. Sec- 
tion 2 provides background information about MAC, 
DAC and the assumptions we make in our work. In Sec- 
tion 3, we present an overview of our approach, includ- 
ing some formal definitions and how we quantitatively 
evaluate rule granularity, followed by a detailed design 
in Section 4. Our methodology and experimental results 
are described in Section 5, and Section 6 discusses and 
summarizes our key techniques, their generalization and 
limitations. In Section 7 a discussion of the related work 
is presented, and finally we conclude with Section 8. 


2 Background and Assumptions 


2.1 DAC and MAC Background 


The traditional Linux kernel uses Discretionary Access 
Control (DAC), meaning the access control policies are 
set at the discretion of the owner of the objects. For ex- 
ample, the root user typically sets the password file to be 
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writable only by herself. However, if the root user mis- 
takenly makes the password file publicly writable, then 
the whole system is at risk. This example shows one ma- 
jor deficiency of DAC, that is, mistakes of individual pol- 
icy decisions can result in the breach of security for the 
entire system. Mandatory Access Control (MAC) is pro- 
posed to address this issue. In a MAC system, there ex- 
ists a system wide security policy, such as “high-integrity 
file must not be modified by low-integrity users”. Even 
if the root mistakenly grants write permissions on the 
password file to everyone, when a normal user tries to 
write the password file the attempt would fail because it 
is against the system-wide policy. MAC is considerably 
“safer” than DAC, but it is also more complex and more 
difficult to implement, especially for large systems like 
Linux [18]. It took Linux developers about two years to 
add MAC to the Linux kernel, and since then it has un- 
dergone many rounds of refinements and extensions. It 
is expected that its development will continue well into 
the future. 


2.2 Assumptions 


We make the following assumptions in our work. 

Reasonably mature code base: Similar to previous 
work on automatic rule extraction [3, 10, 11, 20, 22], we 
assume that the code we work with is reasonably ma- 
ture, i.e., it is mostly correct and already contains an im- 
plementation of the security architecture that is mostly 
working. This does not mean that software development 
ceases. In fact, the software might still be under active 
development and new features continue to be added. Al- 
most all open source and proprietary software falls into 
this category, therefore this assumption does not signifi- 
cantly limit the applicability of this work. 

Software developers not adversarial: We assume 
that software developers are trusted and will not delib- 
erately write code to defeat our rule generation mecha- 
nism. This in general holds for majority of the software 
that exists today, i.e., we believe that the majority of soft- 
ware developers intend to write correct and secure soft- 
ware. In the limited cases where this assumption does 
not hold [25], there exist static analysis techniques that 
can detect such malicious code [29]. However, detecting 
malicious code in general is challenging and remains an 
active open research problem. 

Kernel and hypervisor in the trusted computing 
base: For the two pieces of software that we experi- 
mented with, namely, the Linux kernel and the Xen hy- 
pervisor (virtual machine monitor), we assume that both 
are part of the trusted computing base. Thus, the manda- 
tory access control is in place to prevent user level or 
guest OS level processes from breaking security poli- 
cies. This assumption implies that only if a user process 
or a guest OS process can bypass the MAC mechanism 
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(placed in the kernel or the hypervisor) do we consider it 
a breach of security. The kernel or the hypervisor is free 
to modify the data structures on its own behalf (e.g., for 
bookkeeping purposes) without going through the secu- 
rity checks. This assumption is adopted from the MAC 
architecture of the Linux kernel and Xen hypervisor. 


3 Overview of Our Approach 


In this section, we will present our high level design 
choices of rule inference and violation detection; for- 
mal definitions of security rules, security sensitive oper- 
ations, and security violations; and how we explore dif- 
ferent levels of rule granularity. The detailed design will 
be discussed in the next section. 


3.1 Our approach 

Our approach consists of two steps, as shown in Figure 3. 
In the first step, we generate security specifications auto- 
matically from the source code. The input to the gen- 
erator is the source code and the set of security check 
functions. The output of this step is a set of security 
rules containing a security check function and a security 
sensitive operation represented by a group of data struc- 
ture accesses as shown in Figure 2 (a) (the advantages 
of using a group of data structure accesses to represent 
a sensitive operation are discussed in Section 3.2). In 
the second step, AutoISES takes the source code and 
the rules automatically inferred in the first step as the 
input, and outputs ranked security violations. Note that 
these automatically inferred rules can be used directly by 
AutoISES without manual examination, which reduces 
human involvement to its minimum. 


A list of security 
Source code ? 
check functions 


Step 1: Automatically infer security 
specifications 


Security 
specifications 


v 
































Step 2: Automatically detect violations 
to inferred security specifications 


v 
A ranked list of 
security violations 


Figure 3: The analysis flow of AutoISES. 














Step 1: Inferring Security Rules This paper focuses 
on inferring security rules which mandate that a secu- 
rity sensitive operation must be protected by a security 
check function, i.e., the sensitive operation must not be 
allowed to proceed if the security check fails. In order 
to effectively check such rules against the source code to 


17th USENIX Security Symposium = 383 


384 


detect violations, it is crucial to specify the security rule 
at the source code level. Unfortunately, in reality such 
rules are usually not documented. Therefore, our goal is 
to automatically infer such rules from the source code. 

To infer one rule, we want to discover, for the 
same security check function, what fixed security sen- 
sitive operation must be protected by it. We can in- 
fer this security rule from two angles: (1) we search 
for all instances of the same security check function 
(e.g., security-file-permission()) and discover 
what sensitive operation is frequently protected by (e.g., 
performed after) the check function, or (2) we search 
for all instances of the same security sensitive operation 
(e.g., the 22 data structure accesses shown in Figure 2(a)) 
and then check what security check function is frequently 
used to protect (e.g., invoked before) the operation. We 
use the first method, because it is relatively easy to know 
what the security check functions are in the source code 
(usually documented), but knowing what security sen- 
sitive operations are in the source code itself is still a 
challenge (not documented). Specifically, we look for all 
instances of the same security check in the source code 
and collect sensitive operations protected by it. If this se- 
curity check is frequently used to protect a fixed sensitive 
operation, represented by a fixed set of data structure ac- 
cesses, we infer a security rule: this set of data structure 
accesses must be protected by this security check func- 
tion. Our rationale is that released software is mostly 
correct, so we can infer correct behavior from it. 

It is not uncommon that more than one security check 
function is required to protect one sensitive operation. In 
such cases, our inference approach still works because 
it will infer several separate rules, one for each security 
check function. The set of rules related to the same sensi- 
tive operation combined can detect violations where not 
all of the check functions are invoked to protect the op- 
eration. 

We can infer security rules statically or dynamically. 
While a dynamic approach is more precise, it has poorer 
coverage because only executed code is analyzed. As 
we study large software with millions lines of code, a 
dynamic approach may not be sufficient, which is con- 
firmed by previous work [9, 13]. Therefore, we use inter- 
procedural and flow-insensitive static program analysis 
for rule inference. A more detailed description of our 
static analysis techniques can be found in Section 4.3. 

In summary, our tool AutoISES automatically infers 
sensitive operations in the form of a group of data struc- 
ture accesses that are commonly or frequently protected 
by the same security check function, given a list of secu- 
rity check functions. Similar to previous rule inference 
studies [3, 10, 11, 20, 22], we cannot discover all secu- 
rity rules from the source code alone (discussed in Sec- 
tion 6.3). However, it is effective to infer some important 
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security rules from source code, and detect previously 
unknown security vulnerabilities. 


Step 2: Detecting Violations The goal of this step is to 
use the rules inferred above to detect security violations. 
Similarly, we use inter-procedural and flow-insensitive 
analysis for violation detection. As we already know 
which data structure accesses represent the security sen- 
sitive operation from an inferred rule, we can search for 
instances of the security sensitive operation that are not 
protected by the security check function, indicating se- 
curity violations. 

Specifically, AutoISES starts from each root function 
(automatically generated starting function for our anal- 
ysis and detection as discussed later in Section 4.2.1), 
and collects all data structure accesses and calls to 
security check functions. Then it calculates the 
accessViolationCount, which is the number of ac- 
cesses in the rule that are not protected by the par- 
ticular security check function. Specifically, if an 
access in the rule is performed without being pro- 
tected by a security check function, AutoISES in- 
creases the accessViolationCount by one. We then 
use the accessViolationCount for violation ranking 
— the higher the accessViolationCount is, the more 
likely it is a true violation. We also allow our tool 
users to set up a threshold and only report violations 
with its accessViolationCount higher than the thresh- 
old. Users can always set the threshold to zero to see all 
violation reports. 


Untrusted-space exposability analysis One key tech- 
nique we used to greatly reduce false positives is our 
untrusted-space exposability analysis. As we consider 
the kernel and the hypervisor to be our trusted comput- 
ing base, security sensitive operations in kernel space and 
hypervisor that do not interact with the untrusted space 
(user space or guest OS processes), do not need to be pro- 
tected by a security check function. On the other hand, 
if such sensitive operations interact with the untrusted 
space, e.g., are performed by a user space process via 
system calls, or use data copied from user space, then a 
security check may be mandatory. Since it is typical that 
a large number of sensitive operations are not exposed 
to the untrusted space, most of the detected violations 
would be false alarms, which is detrimental to a detec- 
tion tool. 

To reduce such false positives, we perform a simple 
trusted space exposability study. Specifically, we com- 
piled a list of user space interface functions that are 
known a priori to be exposed to user space, e.g., sys- 
tem calls such as sys-read() and hypercalls. Then, 
AutoISES checks what sensitive operations are reach- 
able from these interface functions. If a sensitive oper- 
ation that can be exposed to the untrusted space is not 
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protected by the proper security check function, we re- 
port the violation as an error, otherwise, we report the 
violation as a warning. Our goal is to ensure most of the 
errors are true violations, but we still generate the warn- 
ings so that developers can examine them if they want 
to. This approach relies on easy-to-obtain information 
(system calls and hypercalls) to automatically reduce the 
number of false positives. 


3.2. Formal Definitions 


Based on our reasoning above, we formally define the 
rule inferencing problem, security sensitive operations, 
security rules, our inference rule, and violations. 


Rule Inferencing Problem Given the target source 
code and a set of n kernel security check functions, 
CheckSet = {Checky, ..., Check, }, each of which can 
check if a subject (e.g., a process), is authorized to per- 
form a certain security sensitive operation, Op; (e.g., 
read, where 1 < i < n), ona certain object (e.g., a file) 
we want to uncover security specifications or security 
rules, Rule;, in the form of a pair, (Check;, Op;), man- 
dating that a security sensitive operation Op;, must be 
protected, <protected, by security check function Check; 
each time Op; is performed. Here protected means that 
the operation Op; can not be performed if the check 
Check; fails. 

A security check function Check; can be called 
multiple times in the program, each of which is called 
an instance of the security check function, denoted as 
InstanceO f (Check;),, where v is between 1 and the 
total number of Check; instances inclusive. Similarly, 
a security sensitive operation Op; can appear in the 
program multiples times, and each of which is called an 
instance of the sensitive operation, InstanceO f (Opi) u- 
If for all instances of the sensitive operation, there exists 
at least one instance of security check function to protect 
it, then we say that the sensitive operation is protected 
by the security check function. Formally speaking, 
VInstanceO f (Op;)u, aInstanceO f (Check; )v, 
such that InstanceO f (Opi) u <protected 
InstanceO f (Check;), => Opi <protectea Check. 





Representing Security Sensitive Operations There 
are several ways to represent security sensitive opera- 
tions at the code level. We can use a list of data structures 
that the operation manipulates, a list of functions the op- 
eration invokes, or the combination of the two. The list 
can be ordered or not ordered, indicating whether we re- 
quire these accesses to be performed in any particular 
order. 

We use data structure accesses to represent a secu- 
rity sensitive operation, because it has two advantages 
over using function calls. First, it can infer rules that 
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function call based analysis would not be able to find. 
For example, if a sensitive operation is performed after 
a check function via different function calls, e.g., A and 
B, by using function A and B to represent the operation, 
we may mistakenly consider nothing is commonly pro- 
tected by the check function and miss the rule. Zoom- 
ing into the functions will allow us to find the shared 
data structure accesses in both A and B. Additionally, 
we can detect more violations by using data structure 
accesses. For examples, if we find that a check func- 
tion always protects function call A at many places, but 
there is a violation that performs the same sensitive op- 
eration via function B without invoking the check func- 
tion first, then we will not be able to detect the viola- 
tion unless we use the data structure accesses to rep- 
resent the sensitive operation in the rule. For exam- 
ple, security-file_permission() is used to protect 
read, write, etc., in Linux 2.6.11, but the check is 
missed when the sensitive operation is called through 
readv (shown in Figure 2(b)), writev, aio_read, or 
aio_write. Therefore, AutoISES would have missed 
all of these violations if it had not used the actual data 
structure accesses to represent the sensitive operation. 

The tradeoffs between considering access orders or not 
are as follows. While preserving access orders is more 
precise, it has two major disadvantages. First, the order 
does not matter for certain rules, and preserving the order 
can cause one to miss the rule. For example, an directory 
removal operation involves setting the inode’s size to 0 
and decrement the number of links to it by one. The order 
in which the two accesses are performed is irrelevant. 
Second, it is more expensive to consider access orders, 
which can affect the scalability of our tool. On the other 
hand, the downside of not considering orders is that we 
can potentially have a higher number of false positives 
due to over-generalization. However, we did not find any 
false positives caused by this reason in this study. 

Therefore, we use a set of unordered data structure ac- 
cesses, AccessSet = { Access}, ..., ACCeSSm}, to repre- 
sent sensitive operation Op, where each data structure 
access is defined as shown in Figure 4. 


Access; := READ AST|WRITE AST | ACCESS AST 





AST := typename(—> field) * | global variable 


Figure 4: Definition of one data structure access. 
ACCESS AST means an access to AST (Abstract Syntax 
Tree), either READ or WRITE. 


Security Rules Replacing the security sensitive opera- 
tion Op; with AccessSet as defined above, we have the 
following definition of security rules: 


Rule; = (Check;, AccessSet;), where Check; € CheckSet 


=> AccessSet;i <protectea Checki. 
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Inference Rule As such rules are usually undocu- 
mented, we want to automatically infer them from source 
code by observing what sensitive operation is frequently 
protected by a security check function, i.e., what sensi- 
tive operation are commonly protected by different in- 
stances of the same security check function. 

Formally speaking, we use the following inference 
tule to infer security rules: 


AccessSet; < frequently protected Check; 


=> InferredRule; = (Check;, AccessSet:), 
where Check; € CheckSet. 


Violations Using such inferred rules, we want to detect 
security violations. An instance of a security sensitive 
operation, InstanceO f(AccessSet;),, is a violation to 
InferredRule; if it is not protected by any instance of 
the security check function. In other words, 


Given InferredRule; = (Check;, AccessSet:), 
Vv InstanceO f (Checki)v, 


InstanceO f (AccessSeti)u £protectea InstanceO f (Checki)y 


=> InstanceOf(AccessSeti)u € Violation. 


In this paper, we use rules and inferred rules inter- 
changeably. 


3.3. Exploring Rule Granularity 


We explore 4 different levels of granularity based on two 
metrics, whether to distinguish read and write access 
types, and whether to distinguish structure fields. The 
four different levels of granularity are as shown in Ta- 
ble 1. For example, the access READ inode->i-size is 
represented as READ inode for Granularity(F'—, A+), 
ACCESS inode->i-size for Granularity(f'+, A-—), 
and ACCESS inode for Granularity(f/’—, A—). 


Distinguishing Structure Fields 


Granularity(F’+, A+) Granularity(F’—, A+) 
READ inode->i-size 
Granularity(F'—, A—) 


No | Granularity(F’+, A—) 
Table 1: Four Levels of Rule Granularity with Examples. 











Disting- | Yes 


uishing 
Access 
Types 





ACCESS inode->i-size 





To better understand the impact of the rule gran- 
ularity on rule inference and violation detection, and 
to gain insight on how well our default granular- 
ity (Granularity(F'+, A+)) performs, we quantitatively 
evaluate the 4 different levels of granularity on the Linux 
kernel and Xen. This exploration is orthogonal to our 
rule inference and violation detection, and can be applied 
to previous rule inference techniques [9, 11, 20, 22, 30]. 
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4 Detailed Design of AutoISES 
4.1 A Naive Approach 


We first describe a naive approach and show why it 
does not work, which motivated us to explore alterna- 
tives. A naive approach is to start the analysis from 
the direct caller functions of a security check function, 
and consider all data structure accesses performed af- 
ter the security check function as the protected sensi- 
tive operation. This approach does not work because 
it introduces obvious imprecision. For example, as 
shown in Figure 5, security-inode_permission() 
is called at the end of function permission(). 
If we start from function permission(), then no 
data structures are accessed after security check 
function security-inode_permission() in function 
permission(), indicating that no data structure ac- 
cess is protected by security-inode_permission(), 
which is clearly not true. This naive approach fails be- 
cause permission() is nota function that actually uses 
the check to protect security sensitive operations. In- 
stead, it is a wrapper function of the security check func- 
tion. The function that actually uses the security check 
function for a permission check is vfs_link() shownin 
the leftmost box of Figure 5. 

To automatically infer security rules, we need to auto- 
matically discover the functions (e.g., vfs_link ( )) that 
actually use security checks for authorization checking. 


4.2 Security Specification Extraction 


The goal of AutoISES is to discover the security sensi- 
tive operation, represented by a group of data structure 
accesses, that is protected by a security check function. 
Why we use data structure accesses to represent a se- 
curity sensitive operation has already been discussed in 
Section 1.2, and the two major advantages of this repre- 
sentation have been described in Section 3.2. To achieve 
this goal, we need to address four major challenges: (1) 
how to automatically discover functions that actually use 
security checks for authorization checking; (2) how to 
define “protected” at the code level; (3) what informa- 
tion to extract; (4) how do we turn such information into 
security rules. 


4.2.1 How to find functions that actually use secu- 
rity checks for authorization checking? 


As shown above, simply starting the analysis from the 
direct callers of a security check function does not work. 
To automatically detect security rules, we need to auto- 
matically find the functions that actually use the check 
function to protect sensitive operations. However, what 
functions actually use the check function for authoriza- 
tion checking depends on the semantics of the software, 
and thus are extremely difficult to extract automatically. 
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linux/fs/namei.c: linux/fs/namei.c: 


‘ { 
error = may_create(dir, ...); 
// data structure accesses 


i. “ 











int vfs_link(...) { ly static int may_create(struct inode “ait, 


error = permission(dir,...) << 


linux/fs/namei.c: 
mM int permission(struct inode *inode, ...) { ... 











return security_inode_permission(inode, ...); 
7 ~ No code ee 
( security_inode_permission() ) 
Nn in function permission(), 
~ 2} a 





Figure 5: Demonstrating the naive approach does not work. 


Instead, we try to automatically extract a good approx- 
imation of these functions. Specifically, we (1) automati- 
cally break the program into modules (e.g., each file sys- 
tem is a module) based on the compilation configurations 
that come with any software (e.g. in Makefile), and (2) 
consider the root functions of each module as functions 
that actually use security check functions for authoriza- 
tion check, where root functions are functions that are 
not called by any other functions in the module. These 
root functions can be automatically extracted by ana- 
lyzing the call graphs of each module. 


Using this approach, AutoISES finds _ that 
sys_link() is a root function for the ext2 file 
system module. Although vfs_link( ) is the direct user 
of the check, this approximation is good because the 
root function sys_link() is the caller of vfs_link(), 
therefore the root function contains all the data structure 
accesses vfs_link() performs. While it can also con- 
tain accesses that are not in vfs_link(), which may not 
be related to the security sensitive operation, it does not 
affect the violation detection accuracy much in practice 
mainly for two reasons. First, since only accesses that 
are protected by many instances of the same check 
function is considered as part of a sensitive operation, 
many unrelated accesses can be automatically elimi- 
nated during the rule generation stage (Section 4.2.4). 
Additionally, during the violation detection stage, we 
can set the threshold for accessViolationCount lower 
to tolerate a few unrelated data structure accesses. Note 
that these root functions are usually a super set of 
our untrusted space interface functions, as many root 
functions can only be called by other kernel modules, 
which are considered trusted. Therefore, our untrusted- 
space exposability study is necessary for reducing false 
positives. 


An alternative solution is to ask developers or tool 
users to provide the functions that actually use the check 
functions. Although it is easier to provide such functions 
than writing the specifications directly, it is not desirable, 
because (1) it is not automatic; (2) one would need to 
manually identify such functions each time new code is 
added; and (3) manually identified these functions can be 
error-prone. 
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4.2.2 What does “protected” mean at the code level? 


An instance of a sensitive operation Op; is considered 
protected by an instance of a security check function 
Check, if the operation is allowed only if it is autho- 
rized as indicated by the return value of the check func- 
tion. To implement this exact semantic, we need to know 
the semantics of return values of all the security check 
functions, which requires significant manual work and 
does not scale; this is not desirable. Therefore, we use 
a source code level approximation of this semantic: a 
security check function protects all data structure ac- 
cesses that appear “after” the security check function in 
an execution trace. Although this approximation can in- 
clude some unrelated accesses, it is reasonably accurate 
and effective at helping detecting violations for the same 
two reasons discussed in Section 4.2.1. Additionally, the 
approximation makes our approach more automatic and 
general, because we do not require developers to provide 
the semantics of the return values of the security check 
functions.Similar to previous static analysis techniques, 
our static analysis does not employ any dynamic execu- 
tion information. Instead, the execution trace we use is a 
static approximation of the dynamic execution trace. 


4.2.3. What information to extract? 


We want to extract data structure accesses that are fre- 
quently protected by a security check function. Since a 
typical program accesses a large number of data struc- 
tures, many of which are irrelevant to the security sensi- 
tive operation, we need to collect the most relevant ac- 
cesses and exclude noise. For example, a loop iterator 
is not interesting for our rule extraction, so we want to 
exclude it. Although all data structure accesses theoreti- 
cally can be protected by a security check function, struc- 
ture field accesses and global variable accesses are more 
commonly protected than short-lived local scalar vari- 
ables. Therefore, we extract all structure field accesses 
and global variable accesses. In addition, a security sen- 
sitive operation, being an aggregate representation of its 
specific instances, is naturally represented by accesses to 
the types of data structures, and not by accesses to spe- 
cific data objects. Thus, our rule inference engine con- 
siders structure types as opposed to actual objects. 
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4.2.4 How to infer rules? 


Starting from the automatically identified root functions, 
we can extract the data structure access set for each in- 
stance of a security check function. To obtain the data 
structure access set protected by the security check func- 
tion, we simply compute the intersection of all of these 
access sets. Since our static analysis can miss some data 
structure accesses for some root functions due to analysis 
imprecision, we do not require accesses to be protected 
by all instances. Instead, if intersecting an access set re- 
sults in an empty set, we drop this access set because it 
is likely to be an incomplete set. As long as there are 
enough security check instances protecting the accesses, 
we are confident the accesses are security sensitive and 
the inferred rule is valid. 

However, different from inferring general program 
rules, many security check functions are called only once 
or twice, which makes it difficult for the intersection 
strategy to be effective. We observed that many such 
functions are only called once or twice because Linux 
uses a centralized place to invoke such checks for dif- 
ferent implementations. For example, check function 
security-inode_rmdir() is only called once in the 
virtual file system level, but it actually protects the sen- 
sitive rmdir operation of many different file systems. 
Therefore, semantically the check function is invoked 
once for each file system. Thus, we can intersect the 
rmdir operations of different file systems to obtain the es- 
sential protected sensitive accesses. This strategy makes 
it possible for AutoISES to automatically generate rules 
of reasonably small sizes with high confidence even for 
check functions that are called only a few times. This 
is realized by performing a function alias analysis and 
generating a separate static trace for each function alias, 
essentially treating each function alias as if it was a sep- 
arate function call. 


4.3 Our Static Analysis 


We use inter-procedural and flow-insensitive static pro- 
gram analysis to infer security rules and detect viola- 
tions. It is important to use inter-procedural analysis, 
because many sensitive data structure accesses related to 
the same sensitive operation are performed in different 
functions. In fact, these accesses can be many (e.g., 18) 
levels apart in the call chain, meaning the caller of one 
access can be the 18th ancestor caller of another access. 
An intra-procedural analysis would not adequately cap- 
ture the security rules or be effective at detecting viola- 
tions. In fact, without our inter-procedural analysis, we 
would not be able to detect almost any of the violations. 
For higher accuracy, we perform full inter-procedural 
analysis, which means that we allow our analysis tool 
to zoom into functions as deep as it can, i.e., until it has 


17th USENIX Security Symposium 


analyzed all reachable functions whose source code is 
available. We chose to use flow-insensitive analysis over 
flow-sensitive analysis because it is less expensive and 
scales better for large software. 

As function pointers are widely used in Linux and 
Xen, we perform simple function pointer analysis by 
resolving a function pointer to functions with the same 
type. Our analysis is conservative in the absence of type 
cast. 


5 Methodology and Results 


We evaluated our tool on the latest versions, at the time 
of writing, of two large open source software, Linux and 
Xen. Table 2 lists their size information. 


Lines of Code | Total # of Check Functions 


[Tix [SOM [OG 


Table 2: Evaluated software. We excluded constructor and de- 
structor type of security check functions from the list, because 
they are not authorization checks. 





Table 3 shows our overall analysis and detection re- 
sults. AutoISES automatically generated 84 code-level 
security rules, which served as the concrete security 
specifications of the two software we studied. These 
specifications are critical for verifying software correct- 
ness and security. Additionally they can help developers 
better understand the code and ease the task of software 
maintenance. We did not generate one rule for each secu- 
rity check mainly because some parts of the source code 
were not compiled for the default Linux kernel or Xen 
configuration, and were therefore not analyzed. 

Based on our untrusted-space exposability study re- 
sults, AutoISES reports violations that can be exposed 
to untrusted space as errors, and the others as warnings 
since they are less likely to be true security violations. 
Using the 84 automatically generated rules, AutoISES 
reported 8 error reports and 293 warning reports. A total 
of 8 true violations were found, 6 of which were from the 
error reports, and 2 were from the warnings reports (only 
the top warnings were examined). Among the 8 true vi- 
olations, 7 of them have been confirmed by the corre- 
sponding developers. All of the automatically inferred 
rules were used by the AutoISES checker directly with- 
out being examined by us or the developers. If higher de- 
tection accuracy is desired, developers or tool users can 
examine rules before using them for violation detection. 

These results demonstrate that AutoISES is effective 
at automatically inferring security rules and detecting vi- 
olations to these rules, which closes an important gap 
in achieving security systems and greatly improves the 
practicality of security property checking and verifica- 
tion tools. 
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# of Warnings 
Software | #ofrules | #of True Violations | False Positives in Errors | # Inspected 


[xe [| 3 [| 1 | [| 3 | 0 | 
Toa [| os [8 | 28 | wa | 5 | 


Table 3: Overall results of AutoISES. Numbers in parentheses are true violations in warning reports. 








linux/fs/sys_splic.c: 
static long do_splice_from(..., struct file *out,...) 


linux/fs/sys_splic.c: 
a long do. _splice_ to (struct file * in, ...) 


linux/net/decnetnettilter/dn_rtmsg.c: 
ce inline void dnrmg_receive_ user_skb(..) 


{ Security check 
7 /security_file laa 
is missing here before 
ae ones write oe 


return out->f_op->splice_write(...); 


ecurity a 


- cam file Sea 
is missing here before 
ier read. ha 


return in->f_op->splice_read(...); 


* ai “Security check ~~ 
ecurity_netlink_recv() should be, 


\ used instead of cap_raised() for 
_eeRy 4 


if (!cap_raised(...) 
Saree EPERW); ... 





(a) 


(b) (c) 


Figure 6: True violations AutoISES automatically detected in the latest versions of Linux kernel. All of these violations have 


already been confirmed by the Linux developers. 


5.1 Detected Violations 


We manually examined every error report and only the 
top warning reports (due to time constraints) to deter- 
mine if a report is a true violation or a false positive. 


5.1.1 True Violations 


There are two types of true violations, exploitable viola- 
tions and consistency violations. 


Exploitable Violations Among the 8 true violations, 
5 are exploitable violations. Figure 6 (a) and (b) show 
two exploitable violations. In Linux 2.6.21.5, secu- 
rity check security-file_permission() was miss- 
ing before the file splice read and file splice write op- 
eration. Without the check, an unauthorized user could 
splice data from pipe to file and vice versa, which could 
cause permanent data loss, information leak, etc. This 
violation has already been confirmed by the Linux de- 
velopers. 


Consistency Violations We term the 3 remaining true 
violations Consistency Violations, meaning that although 
they may not be exploitable, they violate the consistency 
of using security check functions. Such inconsistencies 
can confuse developers and make the software difficult 
to maintain, both of which can contribute to more errors 
in the future. Therefore, it is important for developers to 
fix consistency violations. 

Figure 6 (c) shows an example of a consistency vio- 
lation. A security check security-netlink_recv(), 
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which checks permission before processing 
the received netlink message, was missing in 
dnrmg_receive_user_skb(), which receives and 
processes netlink messages. This error could cause 
the kernel to receive messages from unauthorized 
users. However, dnrmg_receive_user-_skb() 
did call function cap_raised(), which is what 
security_netlink-recv() calls eventually. In other 
words, it bypasses the security check interface functions, 
and calls the backend security policy functions, which is 
a bad practice and should be avoided. 

At the time of writing, 2 out of the 3 consistency vi- 
olations, including the example shown above, have been 
confirmed and fixed by the corresponding developers. 


5.1.2 False Positives 


The false positive rate in error reports is 2 out of 8. There 
are more false positives in the warning reports because 
no untrusted-space exposability analysis is performed on 
the warning reports. Developers can choose to focus on 
the error reports to save time, or also examine the warn- 
ings if desired. 

Several factors can contribute to false positives. First, 
as we use conservative function pointer alias analysis, we 
can mistakenly consider accesses not related to an oper- 
ation as part of the operation, and generate an imprecise 
tule. These extra accesses do not need to be protected by 
the security check, but our tool may still report such false 
violations. A static analysis tool with more advanced 
function pointer alias analysis could reduce such false 
positives. 
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Figure 7: A false positive detected by AutoISES in Linux 
kernel 2.6.21.5. Only related functions are shown. 


Additionally, certain semantics of the target code 
make some of the detected errors not exploitable. 
Figure 7 shows such an example where an im- 
plicit temporal constraint on certain system calls 
allows the coverage of a security check to span 
multiple system calls. AutoISES reported that a 
security check security-file_permission() 
should be called before aio_run_iocb(), but in 
the call chain in Figure 7(1) starting from a sys- 
tem call function sys_io.getevents(), the check 
security-file.permission() was missing. How- 
ever, this is not an exploitable violation, because system 
call sys_io_getevents() cannot be called without 
system call sys_io-submit() being invoked first, 
which consults the proper security check in its callee 
function aio_setup_iocb() in call chain (2). Because 
AutoISES did not know this restriction in using the 
system calls, it reports the violation. However, if 
the file permission is changed after the setup system 
call sys_io_submit() and before the invocation of 
sys_io_getevents(), then unauthorized accesses can 
occur. Linux developers confirmed the potential of such 
violations, but are unlikely to fix it because the current 
Linux implementation does not enforce protection 
against this type of violations. 

There are at least two ways to reduce or eliminate false 
positives. First, we can employ more accurate static anal- 
ysis techniques. Additionally, as increasing granular- 
ity could reduce false positives (discussed later in Sec- 
tion 5.3), we can experiment with even finer granularity, 
such as distinguishing increment, decrement, and zero- 
ing operations, to further reduce false positives. 


5.2 Parameter Sensitivity and Time Over- 
head 


By default, we set the threshold of 
accessViolationCount to be 50% of the rule size, 
which is the total number of data structure accesses in 
a rule. We found that for Linux, the detection results 
are not very sensitive to this parameter, meaning that 
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most true violations perform all or almost all of the data 
structure accesses, and false violations often perform 
none or only a few of the data structure accesses. 
These results show that the generated rules capture the 
implicit security rules well, and these rules are effective 
in helping detecting violations to them. For Xen, the 
results are more sensitive to the threshold. A possible 
explanation is that, in general Xen security checks are 
called fewer times compared to Linux kernel, therefore, 
there are fewer instances for AutoISES to learn precise 
rules. As a result, the inferred Xen rules contain more 
noisy accesses that do not need to be protected by the 
check functions. In this case, we set the threshold to be 
higher, 90%, to minimize the impact of noisy accesses. 

AutoISES spent 86 minutes on inferring 51 rules from 
the entire Linux kernel, and 116 minutes on using these 
rules to check for violations in the entire Linux kernel. 
As the code size of Xen is much smaller, the time spent 
on Xen rule generation is 25 seconds, and 39 minutes for 
detection. This shows that our tool is efficient enough to 
be used in practice for large real world software. 


5.3. Impact of Rule Granularity 


In many cases, a coarse-grained rule is overly 
generalized and thus does not precisely repre- 
sent the implicit security rules. For example, 


two different checks, security_file_link() and 
security_file_unlink() are designed to protect two 
different inode operations. However, as shown in Fig- 
ure 8(a), the inferred operations of Granularity(/’—, A+) 
are the same. Using finer granularity, Granularity(f'+, 
A+), AutoISES is able to automatically infer two differ- 
ent operations (Figure 8(b)-(c)). For example, the unlink 
operation contains access READ inode->i-size, 
which is not part of the link operation. 

Fine-grained rules cause less false positives during 
the detection stage. For 5 randomly selected secu- 
rity checks, compared with the most coarse-grained 
rules (Granularity(/"—,A—)) our most fine-grained rules 
(Granularity(F'+,A+)) on average cause 33% fewer 
false positives (in both error reports and warning re- 
ports). Granularity(/’+, A—) cause 20% fewer false pos- 
itives, and Granularity(F'—, A+) 13.3% fewer. The re- 
sults show that using finer granularity can greatly reduce 
the number of false positives, and adjusting the rule gran- 
ularity could be considered as an important tuning pa- 
rameter for other rule inference and violation detection 
tools [9, 11, 20, 22, 30]. 

Although coarse-grained rules produce a_ higher 
false positive rate, they can provide very useful infor- 
mation that fine-grained rules may fail to unveil. In 
the example above, the operation of Granularity(F'—, 
A+) is shared by almost all inode related security 
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Rule of Granularity(F-, A+): 


Rule of Granularity(F+, A+): 


Rule of Granularity(F+, A+): 





Security Check: 
security_inode_link/security_inode_unlink 
Protected operation: 

1. READ dentry 

2. READ inode 

3. WRITE inode 

4. READ nameidata 

5. WRITE vfsmount 

6. READ (Global) names_cachep 











Security Check: 


ecurity_inode_lin 


Protected operation: 












. READ inode->i_ino 
2. READ inode->i_nlink 
3. WRITE inode->i_nlink 
4, READ inode->i_sb 





Security Check: 


<Security_inode_unlink 


Protected operation: 
2. READ Inode->i_ino 
3. READ inode->i_nlink 
4. WRITE inode->i_nlink 
5. READ inode->i_sb 

















(a) rule for security_inode_link 
and security_inode_unlink 


(b) rule for security_inode_link 


(c) rule for security_inode_unlink 


Figure 8: For two security checks, security-file_link() and security-file_unlink(), the inferred operations of 
Granularity(f/—, A+) are the same. If we use Granularity(F'+, A+), the learned operations are different, e.g., the unlink operation 


FAD inode->i-size. 
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eneral. A more fine-grained rule may fail to reveal the Une 
§ : & y Root functions for analysis: | Automatic root function dis- | 4.2.1 


common behavior among all inode and file operations. 

In addition, a fine-grained rule can be overly specific, 
and cause false negatives. We did not observe such cases 
for our most fine-grained rules in this study, i.e., our most 
fine-grained rules were able to detect all of the true viola- 
tions. The result indicates that the default granularity we 
use is the best among the 4 levels of granularity in terms 
of detection accuracy, as they produce the least number 
of false positives, and the same number of false nega- 
tives as the coarse-grained rules. In the future, we plan 
to experiment with even finer granularity and its impact 
on both false positives and false negatives. 

Results from different levels of granularity can be used 
as a metric for violation ranking. For example, a viola- 
tion that is reported at all levels of granularity is probably 
more likely to be a true violation than one that is reported 
only at some levels. In our future work, we will explore 
using the number of granularity levels a violation occurs 
at to rank violations. 


6 Discussions and Limitations 


6.1 Key Techniques that Make AutoISES 
Work 


Automatically generating security specifications poses 
several key challenges that make previous static analy- 
sis tools not directly applicable. We designed five im- 
portant techniques (first four are new) to address these 
challenges as summarized below (Sec. column lists cor- 
responding sections that describe the techniques): 
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Cannot simply start analysis 
from direct callers of a secu- 
rity check function 


covery: Automatically dis- 
cover functions that actually 
use security check functions 


for authorization check 
Insufficient invocation in-| Leverage different  im- | 4.2.4 
stances of security check | plementations (e.g., from 
functions different file systems) of the 
same operation 

Data structure accesses are | Interprocedural analysis with | 4.3 
spread in different functions. | function pointer analysis 














6.2 Generalization 


Although many of the solutions described above are de- 
signed for inferring security specifications and detecting 
security violations, some of the ideas are general, and 
can be applied to other applications. For example, our 
security rules are an important type of function-data cor- 
relation. Such function-data correlations widely exist in 
programs. Violating these implicit constraints results in 
buggy programs that may cause severe damage. Our 
techniques can be used to infer those general function- 
data correlations, e.g., a lock acquisition function re- 
quired before accessing shared data structures, which can 
be used for detecting concurrency bugs. In addition, the 
strategy of using multiple implementations of the same 
virtual API to generate more precise rules is generally 
applicable to situations where source code at the virtual 
API level is not sufficient to generate reliable rules. 


6.3 Limitations 


False Negatives Similar to previous static analysis 
work, our approach can miss security violations. First, 
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if a security check function is not invoked at all (e.g., 
security-sk_classify_flow is not used in Linux 
2.6.11 yet) or the list of security check functions is in- 
complete, we would not be able to infer rules or detect 
violations related to these missing check functions. 

Additionally, our analysis uses only data structure ac- 
cesses to represent a security operation. Therefore, if the 
source code of such low-level accesses is not available, 
AutoISES will not be able to extract information about 
them, and the representation of the sensitive operation 
would be incomplete, potentially causing false negatives. 

Moreover, AutoISES does not verify if the security 
check is performed on the same object as the sensitive 
operation. Therefore, if the proper security check is in- 
voked, but on a different object, AutoISES will not de- 
tect this violation. Matching the actual object remains as 
our future work. Additionally, our flow-insensitive anal- 
ysis could introduce false negatives. For example, if a se- 
curity check is missing on a taken branch, but the check 
is invoked on the non-taken branch, AutoISES may not 
be able to detect the violation. Using a flow-sensitive 
analysis could address this problem. 


Difficulty in Verifying Violations We manually exam- 
ine error reports and warning reports to determine if a 
report is a true violation or a false positive. Unlike er- 
rors such as buffer overflows and null pointer derefer- 
ences, which are usually easy to confirm after the error 
is detected, the manual verification process for security 
violations is more difficult. To decide if a violation is ex- 
ploitable, one needs to understand the semantics of the 
code, knowing what operations can interact with the un- 
trusted space, such as the user space for Linux, and de- 
sign a feasible way to exploit the attack. Conversely, to 
determine if a violation is a false positive, one needs to 
prove that either the operation is security insensitive, or 
that it is indeed covered by a security check that was not 
included due to analysis imprecision. Sometimes it re- 
quires deep knowledge of not only the target software, 
but also how the APIs are used by client software (e.g., 
the example discussed in Section 5.1.2). Such difficulties 
are mostly due to the inherent characteristics of security 
violations. However, we imagine that the task would be 
much easier for the original developers as they possess 
deep semantics knowledge of the code. 


Non-authorization Checks A small number of se- 
curity checks are not authorization checks, which do 
not protect any security operations. For example, 
security_sk_free() should be called after using a 
kernel sk buffer to clear sensitive data. Our current im- 
plementation does not support such rules where a secu- 
rity check function must be invoked after a certain opera- 
tion. However, such rules can be easily supported by ex- 
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tending our current implementation to include the post- 
operation checks. 


7 Related Work 


Mining Security Sensitive Operations Ganapathy et 
al. used concept analysis to find fingerprints of security 
sensitive operations [15]. While both this approach and 
AutoISES try to map the high level security sensitive 
operation (e.g., rmdir) to its implementation (e.g., the C 
code sequences that actually perform the remove direc- 
tory operation), there are two major differences. First, 
the goals and assumptions are different. We aim to iden- 
tify the pairing relationship between a security check and 
the code level representation of the sensitive operation 
that the check guards. Thus we assume the code already 
implements a reference monitor and is mostly correct; 
our goal is therefore to discover cases where the refer- 
ence monitor is bypassed. Ganapathy’s goal, on the other 
hand, is to retrofit code with security. Thus they assume 
that the code does not have security built in. Rather, they 
need to identify sequences of code that represent a unit of 
security sensitive operation and that should be guarded 
by a security hook. In order to do that they need more 
prior knowledge with regard to the API and the secu- 
rity sensitive data structures. In our case, all informa- 
tion except the list of security check functions, and the 
list of system call functions and hypercall functions, is 
inferred from the code itself. Second, while our inferred 
operations are used directly by our checker without being 
examined manually, their operations still require manual 
refinement prior to use. 

Although automatic hook placement is promising, it 
has not been adopted in reality yet. Therefore, while we 
should encourage automatic hook placement, it is still 
highly desirable to seek alternative, complementary so- 
lutions that can automatically infer security rules from 
existing or legacy source code and detect security vul- 
nerabilities. 


Detection and Verification Tools The past years have 
seen a proliferation of program analysis and verification 
tools that can be used to detect security vulnerabilities or 
verify security properties [2, 4, 5, 6, 9, 12, 14, 16, 18, 27, 
30]. However, no previous work can automatically gen- 
erate code-level security specifications and instead re- 
quire developers or users to provide these specifications. 
Previous work [30] takes manually identified simple se- 
curity rules to check for security vulnerabilities. As dis- 
cussed in details in Section 1, the rules are coarse and 
imprecise, resulting in many false alarms. Additionally, 
the approach can potentially fail to detect cases where the 
check and the operation does not match because the rules 
do not specify which check is required for which opera- 
tion. Edwards et al. dynamically detect inconsistencies 
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between data structure accesses to identify security vul- 
nerabilities [9]. While a dynamic approach is generally 
more accurate, it suffers from coverage problem - only 
code that is executed can be analyzed. In addition, it re- 
quires manually written filtering rules to guide the trace 
analysis in order to detect security violations. 


Inferring Programming Rules Several techniques 
have been proposed to infer different types of program- 
ming rules from source code or execution trace [3, 10, 
11, 20, 22, 24]. As already discussed in Section 1, pre- 
vious techniques is not directly applicable to our prob- 
lem, because they are limited by the types of rules they 
can infer. Specifically, Engler et al. extract programming 
rules based on several manually identified rule templates, 
such as function (A) and (B) should be paired, func- 
tion (F) must be checked for failure, and null pointer 
(P) should not be dereferenced [10]. PR-Miner fo- 
cuses on inferring correlations among functions [20]. 
Variable value related program invariants are inferred 
by Daikon [11], and MUVI[22] infers variable-variable 
correlations for detecting multi-variable inconsistent up- 
date bugs and multi-variable concurrency bugs. A few 
other approaches infer API and/or abstract data type re- 
lated rules[3, 24]. Different from all these studies, we in- 
fer rules related to security functions protecting a group 
of data structure accesses based on our key observa- 
tion. Inferring different types of rules requires differ- 
ent techniques. In addition, dynamic analysis is used 
in [3, 11], therefore the coverage is limited because only 
instrumented and executed code is used for rule learning. 
Moreover, unlike PR-Miner which uses only intraproce- 
dural analysis, our analysis is interprocedural, which is 
one of the key techniques that allow us to infer com- 
plicated and detailed security rules. Additionally, while 
PR-Miner uses more complex data mining techniques 
to infer programming rules, we leverage readily avail- 
able prior knowledge about part of our rules, the secu- 
rity check functions, so that we can extract security rules 
without expensive data mining techniques. 


Inferring Models and Rules in General The general 
idea of automatically extracting models from low-level 
implementation has been discussed in previous litera- 
ture [8, 17, 21]. For example, Lie et al. proposed au- 
tomatic extraction of specifications from actual proto- 
col code and then running the extracted specifications 
through a model checker [21]. While conceptually these 
approaches bear some resemblance to the approach taken 
by AutoISES, we are the first to show the feasibility of 
automatic extraction of security specifications from ac- 
tual implementation. In addition, none of the previous 
tools have demonstrated the ability to scale to programs 
the size of the Linux kernel. 
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Lee et al. [19] use data mining techniques to learn 
intrusion detection model for adaptive intrusion detec- 
tion. Tongaonkar et al. [26] infer high-level security pol- 
icy from low level firewall filtering rules. None of these 
work infer access control related security rules. 


8 Conclusions and Future Work 


This paper makes two contributions. One is to automat- 
ically infer code-level security rules and detect security 
violations. Our tool, AutoISES, automatically inferred 
84 security rules from the latest versions of Linux ker- 
nel and Xen, and used them to detect 8 security vulnera- 
bilities, demonstrating the effectiveness of our approach. 
The second contribution is to take the first step to quan- 
titatively study the impact of the rule granularity on rule 
generation and verification. This approach is orthogonal 
to our first contribution, and can be applied to other rule 
inference tools. 

While this work focuses on rule inference and viola- 
tion detection in Linux kernel and Xen, our techniques 
can be used to generate rules and detect violations in 
other access control systems. In addition, the techniques 
can be applied to infer general function-data correlation 
type of rules, such as lock acquisition functions protect- 
ing shared variables accesses. In the future, we plan to 
improve our analysis and detection accuracy by employ- 
ing a more advanced static analysis tool and using finer 
rule granularity. 
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Abstract 

Despite having been around for more than 25 years, 
buffer overflow attacks are still a major security threat for 
deployed software. Existing techniques for buffer over- 
flow detection provide partial protection at best as they 
detect limited cases, suffer from many false positives, re- 
quire source code access, or introduce large performance 
overheads. Moreover, none of these techniques are easily 
applicable to the operating system kernel. 

This paper presents a practical security environment 
for buffer overflow detection in userspace and ker- 
nelspace code. Our techniques build upon dynamic in- 
formation flow tracking (DIFT) and prevent the attacker 
from overwriting pointers in the application or operat- 
ing system. Unlike previous work, our technique does 
not have false positives on unmodified binaries, protects 
both data and control pointers, and allows for practi- 
cal hardware support. Moreover, it is applicable to the 
kernel and provides robust detection of buffer overflows 
and user/kernel pointer dereferences. Using a full sys- 
tem prototype of a Linux workstation (hardware and soft- 
ware), we demonstrate our security approach in practice 
and discuss the major challenges for robust buffer over- 
flow protection in real-world software. 


1 Introduction 


Buffer overflows remain one of the most critical threats 
to systems security, although they have been prevalent 
for over 25 years. Successful exploitation of a buffer 
overflow attack often results in arbitrary code execu- 
tion, and complete control of the vulnerable application. 
Many of the most damaging worms and viruses [8, 27] 
use buffer overflow attacks. Kernel buffer overflows 
are especially potent as they can override any protection 
mechanisms, such as Solaris jails or SELinux access con- 
trols. Remotely exploitable buffer overflows have been 
found in modern operating systems including Linux [23], 
Windows XP and Vista [48], and OpenBSD [33]. 
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Despite decades of research, the available buffer over- 
flow protection mechanisms are partial at best. These 
mechanisms provide protection only in limited situa- 
tions [9], require source code access [51], cause false 
positives in real-world programs [28, 34], can be de- 
feated by brute force [44], or result in high runtime over- 
heads [29]. Additionally, there is no practical mechanism 
to protect the OS kernel from buffer overflows or unsafe 
user pointer dereferences. 

Recent research has established dynamic information 
flow tracking (DIFT) as a promising platform for detect- 
ing a wide range of security attacks on unmodified bina- 
ries. The idea behind DIFT is to tag (taint) untrusted data 
and track its propagation through the system. DIFT as- 
sociates a tag with every memory location in the system. 
Any new data derived from untrusted data is also tagged. 
If tainted data is used in a potentially unsafe manner, 
such as dereferencing a tagged pointer, a security excep- 
tion is raised. The generality of the DIFT model has led 
to the development of several software [31, 32, 38, 51] 
and hardware [5, 10, 13] implementations. 

Current DIFT systems use a security policy based on 
bounds-check recognition (BR) in order to detect buffer 
overflows. Under this scheme, tainted information must 
receive a bounds check before it can be safely deref- 
erenced as a pointer. While this technique has been 
used to defeat several exploits [5, 10, 10, 13], it suf- 
fers from many false positives and false negatives. In 
practice, bounds checks are ambiguously defined at best, 
and may be completely omitted in perfectly safe situ- 
ations [12, 13]. Thus, the applicability of a BR-based 
scheme is limited, rendering it hard to deploy. 

Recent work has proposed a new approach for pre- 
venting buffer overflows using DIFT [19]. This novel 
technique prevents pointer injection (PI) by the attacker. 
Most buffer overflow attacks are exploited by corrupt- 
ing and overwriting legitimate application pointers. This 
technique prevents such pointer corruption and does not 
rely on recognizing bounds checks, avoiding the false 
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positives associated with BR-based analyses. However, 
this work has never been applied to a large application, 
or the operating system kernel. Moreover, this technique 
requires hardware that is extremely complex and imprac- 
tical to build. 

This paper presents a practical approach for prevent- 
ing buffer overflows in userspace and kernelspace using 
a pointer injection-based DIFT analysis. Our approach 
identifies and tracks all legitimate pointers in the appli- 
cation. Untrusted input must be combined with a legiti- 
mate pointer before being dereferenced. Failure to do so 
will result in a security exception. 

The specific contributions of this work are: 


e We present the first DIFT policy for buffer overflow 
prevention that runs on stripped, unmodified bina- 
ries, protects both code and data pointers, and runs 
on real-world applications such as GCC and Apache 
without false positives. 


e We demonstrate that the same policy is applicable to 
the Linux kernel. It is the first security policy to dy- 
namically protect the kernel code from buffer over- 
flows and user-kernel pointer dereferences without 
introducing false positives. 


e We use a full-system DIFT prototype based on the 
SPARC V8 processor to demonstrate the integra- 
tion of hardware and software techniques for robust 
protection against buffer overflows. Our results are 
evaluated on a Gentoo Linux platform. We show 
that hardware support requirements are reasonable 
and that the performance overhead is minimal. 


e We discuss practical shortcomings of our approach, 
and discuss how flaws can be mitigated using addi- 
tional security policies based on DIFT. 


The remainder of the paper is organized as follows. 
Section 2 reviews related work. Section 3 summarizes 
the Raksha architecture and our full-system prototype. 
Section 4 presents our policy for buffer overflow protec- 
tion for userspace applications, while Section 5 extends 
the protection to the operating system kernel. Section 6 
discusses weaknesses in our approach, and how they can 
be mitigated with other buffer overflow prevention poli- 
cies. Finally, Section 7 concludes the paper. 


2 Related Work 


Buffer overflow prevention is an active area of research 
with decades of history. This section summarizes the 
state of the art in buffer overflow prevention and the 
shortcomings of currently available approaches. 
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2.1 Existing Buffer Overflow Solutions 


Many solutions have been proposed to prevent pointer or 
code corruption by untrusted data. Unfortunately, the so- 
lutions deployed in existing systems have drawbacks that 
prevent them from providing comprehensive protection 
against buffer overflows. 


Canary-based buffer overflow protection uses a ran- 
dom, word-sized canary value to detect overwrites of 
protected data. Canaries are placed before the begin- 
ning of protected data and the value of the canary is ver- 
ified each time protected data is used. A standard buffer 
overflow attack will change the canary value before over- 
writing protected data, and thus canary checks provide 
buffer overflow detection. Software implementations of 
canaries typically require source code access and have 
been used to protect stack linking information [16] and 
heap chunk metadata [39]. Related hardware canary im- 
plementations [21, 46] have been proposed to protect the 
stack return address and work with unmodified binaries. 


However, buffer overflows may be exploited with- 
out overwriting canary values in many situations. For 
example, in a system with stack canaries, buffer over- 
flows may overwrite local variables, even function point- 
ers, because the stack canary only protects stack link- 
ing information. Similarly, heap overflows can overwrite 
neighboring variables in the same heap chunk without 
overwriting canaries. Additionally, this technique may 
change data structure layout by inserting canary words, 
breaking compatibility with legacy applications. Canary- 
based approaches also do not protect other memory re- 
gions such as the global data segment, BSS or custom 
heap allocation arenas. 

Non-executable data protection prevents stack or 
heap data from being executed as code. Modern hard- 
ware platforms, including the x86, support this technique 
by enforcing executable permissions on a per-page basis. 
However, this approach breaks backwards compatibility 
with legacy applications that generate code at runtime 
on the heap or stack. More importantly, this approach 
only prevents buffer overflow exploits that rely on code 
injection. Rather than injecting new code, attackers can 
take control of an application by using existing code in 
the application or libraries. This form of attack, known 
as a return-into-libc exploit, can perform arbitrary com- 
putations and in practice is just as powerful as a code 
injection attack [43]. 

Address space layout randomization (ASLR) is a 
buffer overflow defense that randomizes the memory lo- 
cations of system components [34]. In a system with 
ASLR, the base address of each memory region (stack, 
executable, libraries, heap) is randomized at startup. A 
standard buffer overflow attack will not work reliably, 
as the security-critical information is not easy to locate 
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in memory. ASLR has been adopted on both Linux 
and Windows platforms. However, ASLR is not back- 
wards compatible with legacy code, as it requires pro- 
grams to be recompiled into position-independent exe- 
cutables [49] and will break code that makes assump- 
tions about memory layout. ASLR must be disabled for 
the entire process if it is not supported by the executable 
or any shared libraries. Real-world exploits such as the 
Macromedia Flash buffer overflow attack [14] on Win- 
dows Vista have trivially bypassed ASLR because the 
vulnerable application or its third-party libraries did not 
have ASLR support. 

Moreover, attackers can easily circumvent ASLR on 
32-bit systems using brute-force techniques [44]. On 
little-endian architectures such as the x86, partial over- 
write attacks on the least significant bytes of a pointer 
have been used to bypass ASLR protection [3, 17]. Ad- 
ditionally, ASLR implementations can be compromised 
if pointer values are leaked to the attacker by techniques 
such as format string attacks [3]. 

Overall, while existing defense mechanisms have 
raised the bar, buffer overflow attacks remain a problem. 
Real-world exploits such as [14] and [17] demonstrated 
that a seasoned attacker can bypass even the combination 
of ASLR, stack canaries, and non-executable pages. 


2.2 Dynamic Information Flow Tracking 


Dynamic Information Flow Tracking (DIFT) is a practi- 
cal platform for preventing a wide range of security at- 
tacks from memory corruptions to SQL injections. DIFT 
associates a tag with every memory word or byte. The 
tag is used to taint data from untrusted sources. Most op- 
erations propagate tags from source operands to destina- 
tion operands. If tagged data is used in unsafe ways, such 
as dereferencing a tainted pointer or executing a tainted 
SQL command, a security exception is raised. 

DIFT has several advantages as a security mechanism. 
DIFT analyses can be applied to unmodified binaries. 
Using hardware support, DIFT has negligible overhead 
and works correctly with all types of legacy applica- 
tions, even those with multithreading and self-modifying 
code [7, 13]. DIFT can potentially provide a solution 
to the buffer overflow problem that protects all pointers 
(code and data), has no false positives, requires no source 
code access, and works with unmodified legacy binaries 
and even the operating system. Previous hardware ap- 
proaches protect only the stack return address [21, 46] or 
prevent code injection with non-executable pages. 

There are two major policies for buffer overflow pro- 
tection using DIFT: bounds-check recognition (BR) and 
pointer injection (PI). The approaches differ in tag prop- 
agation rules, the conditions that indicate an attack, and 
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whether tagged input can ever be validated by applica- 
tion code. 

Most DIFT systems use a BR policy to prevent buffer 
overflow attacks [5, 10, 13, 38]. This technique forbids 
dereferences of untrusted information without a preced- 
ing bounds check. A buffer overflow is detected when a 
tagged code or data pointer is used. Certain instructions, 
such as logical AND and comparison against constants, 
are assumed to be bounds check operations that repre- 
sent validation of untrusted input by the program code. 
Hence, these instructions untaint any tainted operands. 

Unfortunately, the BR policy leads to significant false 
negatives [13, 19]. Not all comparisons are bounds 
checks. For example, the glibe strtok() function com- 
pares each input character against a class of allowed 
characters, and stores matches in an output buffer. DIFT 
interprets these comparisons as bounds checks, and thus 
the output buffer is always untainted, even if the input 
to strtok() was tainted. This can lead to false negatives 
such as failure to detect a malicious return address over- 
write in the atphttpd stack overflow [1]. 

However, the most critical flaw of BR-based policies 
is an unacceptable number of false positives with com- 
monly used software. Any scheme for input validation 
on binaries has an inherent false positive risk. While the 
tainted value that is bounds checked is untainted, none 
of the aliases for that value in memory or other registers 
will be validated. Moreover, even trivial programs can 
cause false positives because not all untrusted pointer 
dereferences need to be bounds checked [13]. Many 
common glibc functions, such as tolower(), toupper(), 
and various character classification functions (isalpha(), 
isalnum(), etc.) index an untrusted byte into a 256 entry 
table. This is completely safe, and requires no bounds 
check. However, BR policies fail to recognize this in- 
put validation case because the bounds of the table are 
not known in a stripped binary. Hence, false positives 
occur during common system operations such as com- 
piling files with gcc and compressing data with gzip. 
In practice, false positives occur only for data pointer 
protection. No false positive has been reported on x86 
Linux systems so long as only control pointers are pro- 
tected [10]. Unfortunately, control pointer protection 
alone has been shown to be insufficient [6]. 

Recent work [19] has proposed a pointer injection (PI) 
policy for buffer overflow protection using DIFT. Rather 
than recognize bounds checks, PI enforces a different in- 
variant: untrusted information should never directly sup- 
ply a pointer value. Instead, tainted information must 
always be combined with a legitimate pointer from the 
application before it can be dereferenced. Applications 
frequently add an untrusted index to a legitimate base 
address pointer from the application’s address space. On 
the other hand, existing exploitation techniques rely on 
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injecting pointer values directly, such as by overwriting 
the return address, frame pointers, global offset table en- 
tries, or malloc chunk header pointers. 

To prevent buffer overflows, a PI policy uses two tag 
bits per memory location: one to identify tainted data (T 
bit) and the other to identify pointers (P bit). As in other 
DIFT analyses, the taint bit is set for all untrusted in- 
formation, and propagated during data movement, arith- 
metic, and logical instructions. However, PI provides 
no method for untainting data, nor does it rely on any 
bounds check recognition. The P bit is set only for legiti- 
mate pointers in the application and propagated only dur- 
ing valid pointer operations such as adding a pointer to a 
non-pointer or aligning a pointer to a power-of-2 bound- 
ary. Security attacks are detected if a tainted pointer is 
dereferenced and the P bit is not set. The primary ad- 
vantage of PI is that it does not rely on bounds check 
recognition, thus avoiding the false positive and negative 
issues that plagued the BR-based policies. 

The disadvantage of the PI policy is that it requires 
legitimate application pointers to be identified. For dy- 
namically allocated memory, this can be accomplished 
by setting the P bit of any pointer returned by a memory- 
allocating system call such as mmap or brk. However, 
no such solution has been presented for pointers to stat- 
ically allocated memory regions. The original proposal 
requires that each add or sub instruction determines if 
one of its untainted operands points into any valid virtual 
address range [19] . If so, the destination operand has its 
P bit set, even if the source operand does not. To support 
such functionality, the hardware would need to traverse 
the entire page table or some other variable length data- 
structure that summarizes the allocated portions of the 
virtual address space for every add or subtract instruction 
in the program. The complexity and runtime overhead of 
such hardware is far beyond what is acceptable in mod- 
ern systems. Furthermore, while promising, the PI policy 
has not been evaluated on a wide range of large applica- 
tions, as the original proposal was limited to simulation 
studies with performance benchmarks. 

DIFT has never been used to provide buffer overflow 
protection for the operating system code itself. The OS 
code is as vulnerable to buffer overflows as user code, 
and several such attacks have been documented [23, 33, 
48]. Moreover, the complexity of the OS code represents 
a good benchmark for the robustness of a security policy, 
especially with respect to false positives. 


3 DIFT System Overview 


Our experiments are based on Raksha, a full-system pro- 
totype with hardware support for DIFT [13]. Hardware- 
assisted DIFT provides a number of advantages over 
software approaches. Software DIFT relies on dynamic 
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Figure 1: The system stack for the Raksha DIFT platform. 
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binary translation and incurs significant overheads rang- 
ing from 3x to 37x [31, 38]. Software DIFT does not 
work with self-modifying code and leads to races for 
multithreaded code that result in false positives and neg- 
atives [7]. Hardware support addresses these shortcom- 
ings, and allows us to apply DIFT analysis to the operat- 
ing system code as well. 


3.1 The Raksha Architecture 


Raksha is a DIFT platform that includes hardware and 
software components. It was the first DIFT system to 
prevent both high-level attacks such as SQL injection 
and cross-site scripting, and lower-level attacks such as 
format strings and buffer overflows on unmodified bina- 
ries [13]. Prior to this work, Raksha only supported a 
BR-based policy for buffer overflow protection, and en- 
countered the associated false positives and negatives. 

Raksha extends each register and memory word by 
four tag bits in hardware. Each bit supports an inde- 
pendent security policy specified by software using a set 
of policy configuration registers that define the rules for 
propagating and checking the tag bits for untrusted data. 
Tags and configuration registers are completely transpar- 
ent to applications, which are unmodified binaries. 

Tag operations: Hardware is extended to perform tag 
propagation and checks in addition to the functionality 
defined by each instruction. All instructions in the in- 
struction set are decomposed into one or more primitive 
operations such as arithmetic, logical, etc. Check and 
propagate rules are specified by software at the gran- 
ularity of primitive operations. This allows the secu- 
rity policy configuration to be independent of instruction 
set complexity (CISC vs RISC), as all instructions are 
viewed as a sequence of one or more primitive opera- 
tions. For example, the subtract-and-compare instruction 
in the SPARC architecture is decomposed into an arith- 
metic operation and a comparison operation. Hardware 
first performs tag propagation and checks for the arith- 
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metic operation, followed by propagation and checks for 
the comparison operation. In addition, Raksha allows 
software to specify custom rules for a small number of 
individual instructions. This enables handling corner 
cases within a primitive operation class. For example, 
*xor rl,rl,rl” is a commonly used idiom to reset regis- 
ters, especially on x86 machines. Software can indicate 
that such an instruction untaints its output operand. 

The original Raksha hardware supported AND and 
OR propagation modes when the tag information of two 
input operands is combined to generate the tag for the 
output operand. For this work, we found it necessary to 
support a logical XOR mode for certain operations that 
clear the output tag if both tags are set. 

The Raksha hardware implements tags at word granu- 
larity. To handle byte or halfword updates, the propaga- 
tion rules can specify how to merge the new tag from the 
partial update with the existing tag for the whole word. 
Software can also be used to maintain accurate tags at 
byte granularity, building upon the low overhead excep- 
tion mechanism listed below. Nevertheless, this capabil- 
ity was not necessary for this work, as we focus on pro- 
tecting pointers which are required to be aligned at word 
boundaries by modern executable file formats [41]. 

Security Exceptions: Failing tag checks result in se- 
curity exceptions. These exceptions are implemented as 
user-level exceptions and incur overhead similar to that 
of a function call. As security exceptions do not require 
a change in privilege level, the security policies can also 
be applied to the operating system. A special trusted 
mode provides the security exception handler with di- 
rect access to tag bits and configuration registers. All 
code outside the handler (application or OS code) runs 
in untrusted mode, and may not access tags or config- 
uration registers. We prevent untrusted code from ac- 
cessing, modifying, or executing the handler code or data 
by using one of the four available tag bits to implement 
a sandboxing policy that prevents loads and stores from 
untrusted code to reference monitor memory [13]. This 
ensures handler integrity even during a memory corrup- 
tion attack on the application. 

At the software level, Raksha introduces a security 
monitor module. The monitor is responsible for setting 
the hardware configuration registers for check and prop- 
agate rules based on the active security policies in the 
system. It also includes the handler that is invoked on 
security exceptions. While in some cases a security ex- 
ception leads to immediate program termination, in other 
cases the monitor invokes additional software modules 
for further processing of the security issues. For exam- 
ple, SQL injection protection raises a security exception 
on every database query operation so that the security 
monitor may inspect the current SQL query and verify 
that it does not contain a tainted SQL command. 
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3.2 System Prototype 


Figure | provides an overview of the Raksha system 
along with the changes made to hardware and software 
components. The hardware is based on the Leon SPARC 
V8 processor, a 32-bit open-source synthesizable core 
developed by Gaisler Research [22]. We modified Leon 
to include the security features of Raksha and mapped 
it to a Virtex-II Pro FPGA. Leon uses a single-issue, 7- 
stage pipeline with first-level caches. Its RTL code was 
modified to add 4-bit tags to all user-visible registers, and 
cache and memory locations. In addition, Raksha’s con- 
figuration and exception registers, as well as instructions 
that directly access tags and manipulate the special reg- 
isters, were added to the Leon. Overall, we added 9 in- 
structions and 16 registers to the SPARC V8 ISA. 

The resulting system is a full-featured SPARC Linux 
workstation running Gentoo Linux with a 2.6 kernel. 
DIFT policies are applied to all userspace applications, 
which are unmodified binaries with no source code ac- 
cess or debugging information. The security framework 
is extensible through software, can track information 
flow across address spaces, and can thwart attacks em- 
ploying multiple processes. Since tag propagation and 
checks occur in hardware and are parallel with instruc- 
tion execution, Raksha has minimal impact on the ob- 
served performance [13]. 

Although the following sections will discuss primarily 
the hardware and software issues we observed with the 
SPARC-based prototype, we also comment on the addi- 
tional issues, differences, and solutions for other archi- 
tectures such as the x86. 


4 BOF Protection for Userspace 


To provide comprehensive protection against buffer over- 
flows for userspace applications, we use DIFT with 
a pointer injection (PI) policy. In contrast to previ- 
ous work [19], our PI policy has no false positives on 
large Unix applications, provides reliable identification 
of pointers to statically allocated memory, and requires 
simple hardware support well within the capabilities of 
proposed DIFT architectures such as Raksha. 


4.1 Rules for DIFT Propagation & Checks 


Tables 1 and 2 present the DIFT rules for tag propaga- 
tion and checks for buffer overflow prevention. The rules 
are intended to be as conservative as possible while still 
avoiding false positives. Since our policy is based on 
pointer injection, we use two tag bits per word of mem- 
ory and hardware register. The taint (T) bit is set for 
untrusted data, and propagates on all arithmetic, logical, 
and data movement instructions. Any instruction with 
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Operation Example Meaning Taint Propagation Pointer Propagation 
Load Id r1+imm, r2 r2 =M[rl1+imm] T[x2] =T[M[r1+imm]] | P[r2] = P[M[ri+imm]] 
Store st r2, rl+imm M[r1i+imm] = r2 T[M[r1i+imm]] = T[r2] | P[M[ri+imm]] = P[r2] 
Add/Subtract/Or | add ri, r2, r3 e3=er1-4 r2 T[r3] =T[r1] V T[r2] P[x3]=P[r1] V Plr2] 
And and r1, r2,r3 r3=rlAr2 T[r3] =T[r1] V T[r2] P[r3] =P[r1] @ P[r2] 
All other ALU xor r1,r2,r3 r3=r1@r2 T[r3] =T[r2] V T[r1] P[x3]=0 

Sethi sethi imm, r1 rl=imm Tiz1J=0 P[x1] = Plinsn] 

Jump jmpl r1+imm, r2 | r2=pe;pe=r1+imm | T[r2]=0 P[r2]=1 




















Table 1: The DIFT propagation rules for the taint and pointer bit. T[x] and P[x] refer to the taint (T) or pointer (P) tag bits respectively for memory 


location, register, or instruction x. 


a tainted source operand propagates taint to the destina- 
tion operand (register or memory). The pointer (P) bit is 
initialized for legitimate application pointers and prop- 
agates during valid pointer operations such as pointer 
arithmetic. A security exception is thrown if a tainted in- 
struction is fetched or if the address used in a load, store, 
or jump instruction is tainted and not a valid pointer. 
In other words, we allow a program to combine a valid 
pointer with an untrusted index, but not to use an un- 
trusted pointer directly. 


Our propagation rules for the P bit (Table 1) are de- 
rived from pointer operations used in real code. Any 
operation that could reasonably result in a valid pointer 
should propagate the P bit. For example, we propagate 
the P bit for data movement instructions such as load 
and store, since copying a pointer should copy the P 
bit as well. The and instruction is often used to align 
pointers. To model this behavior, the and propagation 
tule sets the P bit of the destination register if one source 
operand is a pointer, and the other is a non-pointer. Sec- 
tion 4.5 discusses a more conservative and propagation 
policy that results in runtime performance overhead. 


The P bit propagation rule for addition and subtrac- 
tion instructions is more permissive than the policy used 
in [19], due to false positives encountered in legitimate 
code of several applications. We propagate the P bit if 
either operand is a pointer as we encountered real-world 
situations where two pointers are added together. For 
example, the glibc function _itoa_word() is used to 
convert integers to strings. When given a pointer argu- 
ment, it indexes bits of the pointer into an array of dec- 
imal characters on SPARC systems, effectively adding 
two pointers together. 


Moreover, we have found that the call and jmpl 
instructions, which read the program counter (PC) into 
a register, must always set the P bit of their destina- 
tion register. This is because assembly routines such as 
glibc memcpy () on SPARC contain optimized versions 
of Duff’s device that use the PC as a pointer [15]. In 
memcpy (), a call instruction reads PC into a register 
and adds to it the (possibly tainted) copy length argu- 
ment. The resulting value is used to jump into the mid- 
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Operation Example Security Check 
Load Id r1+imm, r2 T[r1] A -=P[r1] 
Store st r2, rl+imm T[r1] A -=P[r1] 
Jump jmpl r1i+imm, r2 | T[r1] A -P[r1] 
Instruction fetch | - T[insn] 








Table 2: The DIFT check rules for BOF detection. rx means register x. 
A security exception is raised if the condition in the rightmost column 
is true. 


dle of a large block of copy statements. Unless the call 
and jmp1 set the destination P bit, this behavior would 
cause a false positive. Similar logic can be found in the 
memcmp () function in glibc for x86 systems. 

Finally, we must propagate the P bit for instructions 
that may initialize a pointer to a valid address in statically 
allocated memory. The only instruction used to initialize 
a pointer to statically allocated memory is sethi. The 
sethi instruction sets the most significant 22 bits of a 
register to the value of its immediate operand and clears 
the least significant 10 bits. If the analysis described in 
Section 4.2.2 determines that a sethi instruction is a 
pointer initialization statement, then the P bit for this in- 
struction is set at process startup. We propagate the P 
bit of the set hi instruction to its destination register at 
runtime. A subsequent or instruction may be used to ini- 
tialize the least significant 10 bits of a pointer, and thus 
must also propagate the P bit of its source operands. 

The remaining ALU operations such as multiply or 
shift should not be performed on pointers. These op- 
erations clear the P bit of their destination operand. If a 
program marshals or encodes pointers in some way, such 
as when migrating shared state to another process [36], 
a more liberal pointer propagation ruleset similar to our 
rules for taint propagation rules may be necessary. 


4.2 Pointer Identification 


The PI-based policy depends on accurate identification 
of legitimate pointers in the application code in order to 
initialize the P bit for these memory locations. When 
a pointer is assigned a value derived from an existing 
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pointer, tag propagation will ensure that the P bit is set 
appropriately. The P bit must only be initialized for root 
pointer assignments, where a pointer is set to a valid 
memory address that is not derived from another pointer. 
We distinguish between static root pointer assignments, 
which initialize a pointer with a valid address in stati- 
cally allocated memory (such as the address of a global 
variable), and dynamic root pointer assignments, which 
initialize a pointer with a valid address in dynamically 
allocated memory. 


4.2.1 Pointers to Dynamically Allocated Memory 


To allocate memory at runtime, user code must use a 
system call. On a Linux SPARC system, there are five 
memory allocation system calls: mmap, mmap2, brk, 
mremap, and shmat. All pointers to dynamically allo- 
cated memory are derived from the return values of these 
system calls. We modified the Linux kernel to set the P 
bit of the return value for any successful memory alloca- 
tion system call. This allows all dynamic root pointer as- 
signments to be identified without false positives or neg- 
atives. Furthermore, we also set the P bit of the stack 
pointer register at process startup. 


4.2.2 Pointers to Statically Allocated Memory 


All static root pointer assignments are contained in the 
data and code sections of an object file. The data section 
contains pointers initialized to statically allocated mem- 
ory addresses. The code section contains instructions 
used to initialize pointers to statically allocated memory 
at runtime. To initialize the P bit for static root pointer 
assignments, we must scan all data and code segments of 
the executable and any shared libraries at startup. 

When the program source code is compiled to a re- 
locatable object file, all references to statically allocated 
memory are placed in the relocation table. Each relo- 
cation table entry stores the location of the memory ref- 
erence, the reference type, the symbol referred to, and 
an optional symbol offset. For example, a pointer in the 
data segment initialized to &x + 4 would have a reloca- 
tion entry with type data, symbol x, and offset 4. When 
the linker creates a final executable or library image from 
a group of object files, it traverses the relocation table in 
each object file and updates a reference to statically al- 
located memory if the symbol it refers to has been relo- 
cated to a new address. 

With access to full relocation tables, static root pointer 
assignments can be identified without false positives or 
negatives. Conceptually, we set the P bit for each in- 
struction or data word whose relocation table entry is 
a reference to a symbol in statically allocated memory. 
However, in practice full relocation tables are not avail- 
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able in executables or shared libraries. Hence, we must 
conservatively identify statically allocated memory ref- 
erences without access to relocation tables. Fortunately, 
the restrictions placed on references to statically allo- 
cated memory by the object file format allow us to detect 
such references by scanning the code and data segments, 
even without a relocation table. The only instructions 
or data that can refer to statically allocated memory are 
those that conform to an existing relocation entry format. 


Like all modern Unix systems, our prototype uses the 
ELF object file format [41]. Statically allocated memory 
references in data segments are 32-bit constants that are 
relocated using the RLSPARC_32 relocation entry type. 
Statically allocated memory references in code segments 
are created using a pair of SPARC instructions, sethi 
and or. A pair of instructions is required to construct 
a 32-bit immediate because SPARC instructions have a 
fixed 32-bit width. The sethi instruction initializes 
the most significant 22 bits of a word to an immediate 
value, while the or instruction is used to initialize the 
least significant 10 bits (if needed). These instructions 
use the R-SPARC_HI22 and R-'SPARC_LO10 relocation 
entry types, respectively. 


Even without relocation tables, we know that stati- 
cally allocated memory references in the code segment 
are specified using a sethi instruction containing the 
most significant 22 bits of the address, and any statically 
allocated memory references in the data segment must 
be valid 32-bit addresses. However, even this knowledge 
would not be useful if the memory address references 
could be encoded in an arbitrarily complex manner, such 
as referring to an address in statically allocated memory 
shifted right by four or an address that has been logically 
negated. Scanning code and data segments for all pos- 
sible encodings would be extremely difficult and would 
likely lead to many false positives and negatives. For- 
tunately, this situation does not occur in practice, as all 
major object file formats (ELF [41], a.out, PE [26], and 
Mach-O) restrict references to statically allocated mem- 
ory to a single valid symbol in the current executable or 
library plus a constant offset. Figure 2 presents a few C 
code examples demonstrating this restriction. 


Algorithm 1 summarizes our scheme initializing the 
P bit for static root pointer assignments without reloca- 
tion tables. We scan any data segments for 32-bit val- 
ues that are within the virtual address range of the cur- 
rent executable or shared library and set the P bit for any 
matches. To recognize root pointer assignments in code, 
we scan the code segment for set hi instructions. If the 
immediate operand of the sethi instruction specifies a 
constant within the virtual address range of the current 
executable or shared library, we set the P bit of the in- 
struction. Unlike the x86, the SPARC has fixed-length 
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int x,y; 

int * p = &x + 0x80000000; // symbol + any 32-bit offset is OK 

int * p = &x; // symbol + no offset is OK 

int * p = (int) &x + (int) &y; // cannot add two symbols, will not compile 
int * p = (int) &x x 4; // cannot multiply a symbol, will not compile 
int * p = (int) &x @ -1; // cannot xor a symbol, will not compile 


Figure 2: C code showing valid and invalid references to statically allocated memory. Variables x, y, and p are global variables. 


Algorithm 1 Pseudocode for identifying static root pointer assignments in SPARC ELF binaries. 





procedure CHECKSTATICCODE(ElfObject 0, Word * w) 
if «w is a sethi instruction then 
x < extract_cst22(*w) 
if x >= o.obj-_start and x < o.obj_end then 
set_p_bit(w) 
end if 
end if 
end procedure 


procedure CHECKSTATICDATA(ElfObject 0, Word * w) 
if *w >= o.obj-start and *w < o.obj_end then 
set_p_bit(w) 
end if 
end procedure 


procedure INITSTATICPOINTER(ElfObject 0) 
for all segment s in o do 
for all word w in segment s do 
if sis executable then 
CheckStaticCode(o, w) 
end if 
CheckStaticData(o, w) 
end for 
end for 
end procedure 


> extract 22 bit constant from sethi, set least significant 10 bits to zero 


> Executable sections may contain read-only data 





instructions, allowing for easy disassembly of all code 
regions. 


Modern object file formats do not allow executables 
or libraries to contain direct references to another object 
file’s symbols, so we need to compare possible pointer 
values against only the current object file’s start and end 
addresses, rather than the start and end addresses of all 
executable and libraries in the process address space. 
This algorithm is executed once for the executable at 
startup and once for each shared library when it is ini- 
tialized by the dynamic linker. As shown in Section 4.5, 
the runtime overhead of the initialization is negligible. 


In contrast with our scheme, pointer identification in 
the original proposal for a PI-based policy is impracti- 
cal. The scheme in [19] attempts to dynamically detect 
pointers by checking if the operands of any instructions 
used for pointer arithmetic can be valid pointers to the 
memory regions currently used by the program. This re- 
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quires scanning the page tables for every add or subtract 
instruction, which is prohibitively expensive. 


4.3 Discussion 


False positives and negatives due to P bit initializa- 
tion: Without access to the relocation tables, our 
scheme for root pointer identification could lead to false 
positives or negatives in our security analysis. If an inte- 
ger in the data segment has a value that happens to cor- 
respond to a valid memory address in the current exe- 
cutable or shared library, its P bit will be set even though 
it is not a pointer. This misclassification can cause a 
false negative in our buffer overflow detection. A false 
positive in the buffer overflow protection is also possi- 
ble, although we have not observed one in practice thus 
far. All references to statically allocated memory are re- 
stricted by the object file format to a single symbol plus 
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a constant offset. Our analysis will fail to identify a 
pointer only if this offset is large enough to cause the 
symbol+offset sum to refer to an address outside of the 
current executable object. Such a pointer would be out- 
side the bounds of any valid memory region in the exe- 
cutable and would cause a segmentation fault if derefer- 
enced. 

DIFT tags at word granularity: Unlike prior 
work [19], we use per-word tags (P and T bits) rather 
than per-byte tags. Our policy targets pointer corrup- 
tion, and modern ABIs require pointers to be naturally 
aligned, 32-bit values, even on the x86 [41]. Hence, we 
can reduce the memory overhead of DIFT from eight bits 
per word to two bits per word. 

As explained in Section 3, we must specify how to 
handle partial word writes during byte or halfword stores. 
These writes only update part of a memory word and 
must combine the new tag of the value being written 
to memory with the old tag of the destination memory 
word. The combined value is then used to update the tag 
of the destination memory word. For taint tracking (T 
bit), we OR the new T bit with the old one in memory, 
since we want to track taint as conservatively as possi- 
ble. Writing a tainted byte will taint the entire word of 
memory, and writing an untainted byte to a tainted word 
will not untaint the word. For pointer tracking (P bit), 
we must balance protection and false positive avoidance. 
We want to allow a valid pointer to be copied byte-per- 
byte into a word of memory that previously held an in- 
teger and still retain the P bit. However, if an attacker 
overwrites a single byte of a pointer [18], that pointer 
should lose its P bit. To satisfy these requirements, byte 
and halfword store instructions always set the destina- 
tion memory word’s P bit to that of the new value being 
written, ignoring the old P bit of the destination word. 

Caching P Bit initialization: For performance rea- 
sons, it is unwise to always scan all memory regions of 
the executable and any shared libraries at startup to ini- 
tialize the P bit. P bit initialization results can be cached, 
as the pointer status of an instruction or word of data at 
startup is always the same. The executable or library can 
be scanned once, and a special ELF section containing 
a list of root pointer assignments can be appended to the 
executable or library file. At startup, the security monitor 
could read this ELF section, initializing the P bit for all 
specified addresses without further scanning. 


4.4 Portability to Other Systems 


We believe that our approach is portable to other archi- 
tectures and operating systems. The propagation and 
check rules reflect how pointers are used in practice and 
for the most part are architecture neutral. However, our 
pointer initialization rules must be ported when moving 
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to a new platform. Identifying dynamic root pointer as- 
signments is OS-dependent, but requires only modest ef- 
fort. All we require is a list of system calls that dynami- 
cally allocate memory. 

Identifying static root pointer assignments depends 
on both the architecture and the object file format. 
We expect our analysis for static pointer initializations 
within data segments to work on all modern platforms. 
This analysis assumes that initialized pointers within the 
data segment are word-sized, naturally aligned variables 
whose value corresponds to a valid memory address 
within the executable. To the best of our knowledge, this 
assumption holds for all modern object file formats, in- 
cluding the dominant formats for x86 systems [26, 41]. 

Static root pointer assignments in code segments can 
be complex to identify for certain architectures. Port- 
ing to other RISC systems should not be difficult, as all 
RISC architectures use fixed-length instructions and pro- 
vide an equivalent to sethi. For instance, MIPS uses 
the load-upper-immediate instruction to set the high 16 
bits of a register to a constant. Hence, we just need to 
adjust Algorithm | to target these instructions. 

However, CISC architectures such as the x86 require a 
different approach because they support variable-length 
instructions. Static root pointer assignments are per- 
formed using an instruction such as mov 1 that initializes 
a register to a full 32-bit constant. However, CISC object 
files are more difficult to analyze, as precisely disassem- 
bling a code segment with variable-length instructions is 
undecidable. To avoid the need for precise disassembly, 
we can conservatively identify potential instructions that 
contain a reference to statically allocated memory. 

A conservative analysis to perform P bit initialization 
on CISC architectures would first scan the entire code 
segment for valid references to statically allocated mem- 
ory. A valid 32-bit memory reference may begin at any 
byte in the code segment, as a variable-length ISA places 
no alignment restrictions on instructions. For each valid 
memory reference, we scan backwards to determine if 
any of the bytes preceding the address can form a valid 
instruction. This may require scanning a small number 
of bytes up to the maximum length of an ISA instruction. 
Disassembly may also reveal multiple candidate instruc- 
tions for a single valid address. We examine each can- 
didate instruction and conservatively set the P bit if the 
instruction may initialize a register to the valid address. 
This allows us to conservatively identify all static root 
pointer assignments, even without precise disassembly. 


4.5 Evaluation 


To evaluate our security scheme, we implemented our 
DIFT policy for buffer overflow prevention on the Rak- 
sha system. We extended a Linux 2.6.21.1 kernel to set 
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Program Vulnerability 


Attack Detected 








polymorph [35] | Stack overflow 


Overwrite frame pointer, return address 











atphttpd [1] Stack overflow | Overwrite frame pointer, return address 
sendmail [24] BSS overflow Overwrite application data pointer 
traceroute [42] Double free Overwrite heap metadata pointer 














nullhttpd [30] Heap overflow 





Overwrite heap metadata pointer 





Table 3: The security experiments for BOF detection in userpace. 






































Program PI (normal) | PI (and emulation) 
164.gzip 1.002x 1.320x 
175.vpr 1.001x 1.000x 
176.gcc 1.000x 1.065x 
181.mcf 1.000x 1.010x 
186.crafty 1.000x 1.000x 
197.parser 1.000x 2.230x 
254.gap 1.000x 2.590x 
255.vortex 1.000x 1.130x 
256.bzip2 1.000x 1.050x 
300.twolf 1.000x 1.010x 

















Table 4: Normalized execution time after the introduction of the PI- 
based buffer overflow protection policy. The execution time without 
the security policy is 1.0. Execution time higher than 1.0 represents 
performance degradation. 


the P bit for pointers returned by memory allocation sys- 
tem calls and to initialize taint bits. Policy configuration 
registers and register tags are saved and restored during 
traps and interrupts. We taint the environment variables 
and program arguments when a process is created, and 
also taint any data read from the filesystem or network. 
The only exception is reading executable files owned by 
root or a trusted user. The dynamic linker requires root- 
owned libraries and executables to be untainted, as it 
loads pointers and executes code from these files. 

Our security monitor initializes the P bit of each li- 
brary or executable in the user’s address space and han- 
dles security exceptions. The monitor was compiled as 
a Statically linked executable. The kernel loads the mon- 
itor into the address space of every application, includ- 
ing init. When a process begins execution, control is 
first transferred to the monitor, which performs P bit ini- 
tialization on the application binary. The monitor then 
sets up the policy configuration registers with the buffer 
overflow prevention policy, disables trusted mode, and 
transfers control to the real application entry point. The 
dynamic linker was slightly modified to call back to the 
security monitor each time a new library is loaded, so that 
P bit initialization can be performed. All application and 
library instructions in all userspace programs run with 
buffer overflow protection. 

No userspace applications or libraries, excluding the 
dynamic linker, were modified to support DIFT analysis. 
All binaries in our experiments are stripped, and contain 
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no debugging information or relocation tables. The secu- 
rity of our system was evaluated by attempting to exploit 
a wide range of buffer overflows on vulnerable, unmod- 
ified applications. The results are presented in Table 3. 
We successfully prevented both control and data pointer 
overwrites on the stack, heap, and BSS. In the case of 
polymorph, we also tried to corrupt a single byte or a 
halfword of the frame pointer instead of the whole word. 
Our policy detected the attack correctly as we do track 
partial pointer overwrites (see Section 4.3). 


To test for false positives, we ran a large number of 
real-world workloads such as compiling applications like 
Apache, booting the Gentoo Linux distribution, and run- 
ning Unix binaries such as perl, GCC, make, sed, awk, 
and ntp. No false positives were encountered, despite our 
conservative tainting policy. 


To evaluate the performance overhead of our policy, 
we ran 10 integer benchmarks from the SPECcpu2000 
suite. Table 4 (column titled “PI (normal)”) shows 
the overall runtime overhead introduced by our security 
scheme, assuming no caching of the P bit initialization. 
The runtime overhead is negligible (<0.1%) and solely 
due to the initialization of the P bit. The propagation and 
check of tag bits is performed in hardware at runtime and 
has no performance overhead [13]. 


We also evaluated the more restrictive P bit propaga- 
tion rule for and instructions from [19]. The P bit of 
the destination operand is set only if the P bit of the 
source operands differ, and the non-pointer operand has 
its sign bit set. The rationale for this is that a pointer will 
be aligned by masking it with a negative value, such as 
masking against —4 to force word alignment. If the user 
is attempting to extract a byte from the pointer — an op- 
eration which does not create a valid pointer, the sign bit 
of the mask will be cleared. 


This more conservative rule requires any and instruc- 
tion with a pointer argument to raise a security excep- 
tion, as the data-dependent tag propagation rule is too ex- 
pensive to support in hardware. The security exception 
handler performs this propagation in software for and 
instructions with valid pointer operands. While we en- 
countered no false positives with this rule, performance 
overheads of up to 160% were observed for some SPEC- 
cpu2000 benchmarks (see rightmost column in Table 4). 
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We believe this stricter and propagation policy provides 
a minor improvement in security and does not justify the 
increase in runtime overhead. 


5 Extending BOF Protection to Ker- 
nelspace 


The OS kernel presents unique challenges for buffer 
overflow prevention. Unlike userspace, the kernel shares 
its address space with many untrusted processes, and 
may be entered and exited via traps. Hardcoded con- 
stant addresses are used to specify the beginning and 
end of kernel memory maps and heaps. The kernel may 
also legitimately dereference untrusted pointers in cer- 
tain cases. Moreover, the security requirements for the 
kernel are higher as compromising the kernel is equiva- 
lent to compromising all applications and user accounts. 

In this section, we extend our userspace buffer over- 
flow protection to the OS kernel. We demonstrate our 
approach by using the PI-based policy to prevent buffer 
overflows in the Linux kernel. In comparison to prior 
work [11], we do not require the operating system to be 
ported to a new architecture, protect the entire OS code- 
base with no real-world false positives or errors, support 
self-modifying code, and have low runtime overhead. 
We also provide the first comprehensive runtime detec- 
tion of user-kernel pointer dereference attacks. 


5.1 Entering and Exiting Kernelspace 


The tag propagation and check rules described in Tables 
1 and 2 for userspace protection are also used with the 
kernel. The kernelspace policy differs only in the P and T 
bit initialization and the rules used for handling security 
exceptions due to tainted pointer dereferences. 

Nevertheless, the system may at some point use dif- 
ferent security policies for user and kernel code. To en- 
sure that the proper policy is applied to all code execut- 
ing within the operating system, we take advantage of 
the fact that the only way to enter the kernel is via a trap, 
and the only way to exit is by executing a return from 
trap instruction. When a trap is received, trusted mode 
is enabled by hardware and the current policy configu- 
ration registers are saved to the kernel stack by the trap 
handler. The policy configuration registers are then re- 
initialized to the kernelspace buffer overflow policy and 
trusted mode is disabled. Any subsequent code, such as 
the actual trap handling code, will now execute with ker- 
nel BOF protection enabled. When returning from the 
trap, the configuration registers for the interrupted user 
process must be restored. 

The only kernel instructions that do not execute with 
buffer overflow protection enabled are the instructions 
that save and restore configuration registers during trap 
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entry and exit, a few trivial trap handlers written in as- 
sembly which do not access memory at all, and the fast 
path of the SPARC register window overflow/underflow 
handler. We do not protect these handlers because they 
do not use a runtime stack and do not access kernel mem- 
ory unsafely. Enabling and disabling protection when 
entering and exiting such handlers could adversely affect 
system performance without improving security. 


5.2 Pointer Identification in the Presence 
of Hardcoded Addresses 


The OS kernel uses the same static root pointer assign- 
ment algorithm as userspace. At boot time, the kernel 
image is scanned for static root pointer assignments by 
scanning its code and data segments, as described in Sec- 
tion 4. However, dynamic root pointer assignments must 
be handled differently. In userspace applications, dy- 
namically allocated memory is obtained via OS system 
calls such as mmap or brk. In the operating system, a 
variety of memory map regions and heaps are used to dy- 
namically allocate memory. The start and end virtual ad- 
dresses for these memory regions are specified by hard- 
coded constants in kernel header files. All dynamically 
allocated objects are derived from the hardcoded start 
and end addresses of these dynamic memory regions. 

In kernelspace, all dynamic root pointer assignments 
are contained in the kernel code and data at startup. 
When loading the kernel at system boot time, we scan 
the kernel image for references to dynamically allocated 
memory maps and heaps. All references to dynamically 
allocated memory must be to addresses within the ker- 
nel heap or memory map regions identified by the hard- 
coded constants. To initialize the P bit for dynamic root 
pointer assignments, any set hi instruction in the code 
segment or word of data in the data segment that spec- 
ifies an address within one of the kernel heap or mem- 
ory map regions will have its P bit set. Propagation will 
then ensure that any values derived from these pointers at 
runtime will also be considered valid pointers. The P bit 
initialization for dynamic root pointer assignments and 
the initialization for static root pointer assignments can 
be combined into a single pass over the code and data 
segments of the OS kernel image at bootup. 

On our Linux SPARC prototype, the only heap or 
memory map ranges that should be indexed by untrusted 
information are the umalloc heap and the fixed address, 
pkmap, and srmmu — nocache memory map regions. 
The start and end values for these memory regions can 
be easily determined by reading the header files of the 
operating system, such as the vaddrs SPARC-dependent 
header file in Linux. All other memory map and and heap 
regions in the kernel are small private I/O memory map 
regions whose pointers should never be indexed by un- 
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trusted information and thus do not need to be identified 
during P bit initialization to prevent false positives. 

Kernel heaps and memory map regions have an inclu- 
sive lower bound, but exclusive upper bound. However, 
we encountered situations where the kernel would com- 
pute valid addresses relative to the upper bound. In this 
situation, a register is initialized to the upper bound of 
a memory region. A subsequent instruction subtracts a 
non-zero value from the register, forming a valid address 
within the region. To allow for this behavior, we treat a 
sethi constant as a valid pointer if its value is greater 
than or equal to the lower bound of a memory region and 
less than or equal to the upper bound of a memory re- 
gion, rather than strictly less than the upper bound. This 
issue was never encountered in userspace. 


5.3. Untrusted Pointer Dereferences 


Unlike userspace code, there are situations where the ker- 
nel may legitimately dereference an untrusted pointer. 
Many OS system calls take untrusted pointers from 
userspace as an argument. For example, the second ar- 
gument to the write system call is a pointer to a user 
buffer. 

Only special routines such as copy_to_user() in 
Linux or copyin() in BSD may safely dereference a 
userspace pointer. These routines typically perform a 
simple bounds check to ensure that the user pointer does 
not point into the kernel’s virtual address range. The un- 
trusted pointer can then safely be dereferenced without 
compromising the integrity of the OS kernel. If the ker- 
nel does not perform this access check before derefer- 
encing a user pointer, the resulting security vulnerability 
allows an attacker to read or write arbitrary kernel ad- 
dresses, resulting in a full system compromise. 

We must allow legitimate dereferences of tainted 
pointers in the kernel, while still preventing pointer cor- 
ruption from buffer overflows and detecting unsafe user 
pointer dereferences. Fortunately, the design of modern 
operating systems allows us to distinguish between le- 
gitimate and illegitimate tainted pointer dereferences. In 
the Linux kernel and other modern UNIX systems, the 
only memory accesses that should cause an MMU fault 
are accesses to user memory. For example, an MMU 
fault can occur if the user passed an invalid memory ad- 
dress to the kernel or specified an address whose con- 
tents had been paged to disk. The kernel must distin- 
guish between MMU faults due to load/stores to user 
memory and MMU faults due to bugs in the OS kernel. 
For this purpose, Linux maintains a list of all kernel in- 
structions that can access user memory and recovery rou- 
tines that handle faults for these instructions. This list is 
kept in the special ELF section __ex_table in the Linux 
kernel image. When an MMU fault occurs, the kernel 
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searches __ex_table for the faulting instruction’s address. 
If a match is found, the appropriate recovery routine is 
called. Otherwise, an operating system bug has occurred 
and the kernel panics. 

We modified our security handler so that on a security 
exception due to a load or store to an untrusted pointer, 
the memory access is allowed if the program counter 
(PC) of the faulting instruction is found in the __ex_table 
section and the load/store address does not point into ker- 
nelspace. Requiring tainted pointers to specify userspace 
addresses prevents user/kernel pointer dereference at- 
tacks. Additionally, any attempt to overwrite a kernel 
pointer using a buffer overflow attack will be detected be- 
cause instructions that access the corrupted pointer will 
not be found in the __ex_table section. 


5.4 Portability to Other Systems 


We believe this approach is portable to other architec- 
tures and operating systems. To perform P bit initializa- 
tion for a new operating system, we would need to know 
the start and end addresses of any memory regions or 
heaps that would be indexed by untrusted information. 
Alternatively, if such information was unavailable, we 
could consider any value within the kernel’s virtual ad- 
dress space to be a possible heap or memory map pointer 
when identifying dynamic root pointer assignments at 
system bootup. 

Our assumption that MMU faults within the kernel oc- 
cur only when accessing user addresses also holds for 
FreeBSD, NetBSD, OpenBSD, and OpenSolaris. Rather 
than maintaining a list of instructions that access user 
memory, these operating systems keep a special MMU 
fault recovery function pointer in the Process Control 
Block (PCB) of the current task. This pointer is only 
non-NULL when executing routines that may access user 
memory, such as copyin(). If we implemented our 
buffer overflow protection for these operating systems, a 
tainted load or store would be allowed only if the MMU 
fault pointer in the PCB of the current process was non- 
NULL and the load or store address did not point into 
kernelspace. 


5.5 Evaluation 


To evaluate our buffer overflow protection scheme with 
OS code, we enabled our PI policy for the Linux ker- 
nel. The SPARC BIOS was extended to initialize the P 
bit for the OS kernel at startup. After P bit initializa- 
tion, the BIOS initializes the policy configuration regis- 
ters, disables trusted mode, and transfers control to the 
entry point of the OS kernel with buffer overflow protec- 
tion enabled. 
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Module Targeted 


Vulnerability 


Attack Detected 








quotactl system call [52] 


User/kernel pointer 


Tainted pointer to kernelspace 





i20 driver [52] 


User/kernel pointer 


Tainted pointer to kernelspace 





sendmsg system call [2, 47] 


Heap overflow 
Stack Overflow 


Overwrite heap metadata pointer 
Overwrite local data pointer 





moxa driver [45] 


BSS Overflow 


Overwrite BSS data pointer 





cm4040 driver [40] 








Heap Overflow 


Overwrite heap metadata pointer 











Table 5: The security experiments for BOF detection in kernelspace. 


When running the kernel, we considered any data re- 
ceived from the network or disk to be tainted. Any data 
copied from userspace was also considered tainted, as 
were any system call arguments from a userspace system 
call trap. As specified in Section 4.5, we also save/restore 
policy registers and register tags during traps. The above 
modifications were the only changes made to the ker- 
nel. All other code, even optimized assembly copy rou- 
tines, context switching code, and bootstrapping code at 
startup, were left unchanged and ran with buffer over- 
flow protection enabled. Overall, our extensions added 
1774 lines to the kernel and deleted 94 lines, mostly in 
architecture-dependent assembly files. Our extensions 
include 732 lines of code for the security monitor, written 
in assembly. 

To evaluate the security of our approach, we exploited 
real-world user/kernel pointer dereference and buffer 
overflow vulnerabilities in the Linux kernel. Our re- 
sults are summarized in Table 5. The sendmsg vulner- 
ability allows an attacker to choose between overwriting 
a heap buffer or stack buffer. Our kernel security pol- 
icy was able to prevent all exploit attempts. For device 
driver vulnerabilities, if a device was not present on the 
FPGA-based prototype, we simulated sufficient device 
responses to reach the vulnerable section of code and per- 
form our exploit. 

We evaluated the issue of false positives by run- 
ning the kernel with our security policy enabled un- 
der a number of system call-intensive workloads. We 
compiled large applications from source, booted Gen- 
too Linux, performed logins via OpenSSH, and served 
web pages with Apache. Despite our conservative taint- 
ing policy, we encountered only one issue, which ini- 
tially seemed to be a false positive. However, we have 
established it to be a bug and potential security vulner- 
ability in the current Linux kernel on SPARC32 and 
have notified the Linux kernel developers. This issue 
occurred during the _lbzero () routine, which derefer- 
enced a tainted pointer whose address was not found in 
the _ex_table section. As user pointers may be passed 
to _bzero(), all memory operations in _-bzero () 
should be in __ex_table. Nevertheless, a solitary block 
of store instructions did not have an entry. A malicious 
user could potentially exploit this bug to cause a local 
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denial-of-service attack, as any MMU faults caused by 
these stores would cause a kernel panic. After fixing this 
bug by adding the appropriate entry to __ex_table, no fur- 
ther false positives were encountered in our system. 


Performance overhead is negligible for most work- 
loads. However, applications that are dominated by copy 
operations between userspace and kernelspace may suf- 
fer noticeable slowdown, up to 100% in the worst case 
scenario of a file copy program. This is due to runtime 
processing of tainted user pointer dereferences, which re- 
quire the security exception handler to verify the tainted 
pointer address and find the faulting instruction in the 
__ex_table section. 


We profiled our system and determined that almost 
all of our security exceptions came from a single ker- 
nel function, copy_user(). To eliminate this overhead, 
we manually inserted security checks at the beginning 
of copy_user() to validate any tainted pointers. After 
the input is validated by our checks, we disable data 
pointer checks until the function returns. This change re- 
duced our performance overhead to a negligible amount 
(<0.1%), even for degenerate cases such as copying files. 
Safety is preserved, as the initial checks verify that the ar- 
guments to this function are safe, and manual inspection 
of the code confirmed that copy_user() would never be- 
have unsafely, so long as its arguments were validated. 
Our control pointer protection prevents attackers from 
jumping into the middle of this function. Moreover, 
while checks are disabled while copy_user() is execut- 
ing, taint propagation is still on. Hence, copy_user() 
cannot be used to sanitize untrusted data. 


6 Comprehensive Protection with Hybrid 
DIFT Policies 


The PI-based policy presented in this paper prevents at- 
tackers from corrupting any code or data pointers. How- 
ever, false negatives do exist, and limited forms of mem- 
ory corruption attacks may bypass our protection. This 
should not be surprising, as our policy focuses on a spe- 
cific class of attacks (pointer overwrites) and operates on 
unmodified binaries without source code access. In this 
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section, we discuss security policies that can be used to 
mitigate these weaknesses. 

False negatives can occur if the attacker overwrites 
non-pointer data without overwriting a pointer [6]. This 
is a limited form of attack, as the attacker must use a 
buffer overflow to corrupt non-pointer data without cor- 
rupting any pointers. The application must then use the 
corrupt data in a security-sensitive manner, such as an 
array index or a flag determining if a user is authenti- 
cated. The only form of non-pointer overwrite our PI 
policy detects is code overwrites, as tainted instruction 
execution is forbidden. Non-pointer data overwrites are 
not detected by our PI policy and must be detected by a 
separate, complementary buffer overflow protection pol- 


icy. 


6.1 Preventing Pointer Offset Overwrites 


The most frequent way that non-pointers are used in a 
security-sensitive manner is when an integer is used as 
an array index. If an attacker can corrupt an array index, 
the next access to the array using the corrupt offset will 
be attacker-controlled. This indirectly allows the attacker 
to control a pointer value. For example, if the attacker 
wants to access a memory address y and can overwrite 
an index into array x, then the attacker should overwrite 
the index with the value y—x. The next access to x using 
the corrupt index will then access y instead. 

Our PI policy does not prevent this attack because no 
pointer was overwritten. We cannot place restrictions on 
array indices or other type of offsets without bounds in- 
formation or bounds check recognition. Without source 
code access or application-specific knowledge, it is dif- 
ficult to formulate general rules to protect non-pointers 
without false positives. If source code is available, the 
compiler may be able to automatically identify security- 
critical data, such as array offsets and authentication 
flags, that should never be tainted [4]. 

A recently proposed form of ASLR [20] can be used 
to protect against pointer offset overwrites. This novel 
ASLR technique randomizes the relative offsets between 
variables by permuting the order of variables and func- 
tions within a memory region. This approach would 
probabilistically prevent all data and code pointer offset 
overwrites, as the attacker would be unable to reliably 
determine the offset between any two variables or func- 
tions. However, randomizing relative offsets requires ac- 
cess to full relocation tables and may not be backwards 
compatible with programs that use hardcoded addresses 
or make assumptions about the memory layout. The re- 
mainder of this section discusses additional DIFT poli- 
cies to prevent non-pointer data overwrites without the 
disadvantages of ASLR. 
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6.2 Protecting Offsets for Control Pointers 


To the best of our knowledge, only a handful of reported 
vulnerabilities allow control pointer offsets to be over- 
written [25, 50]. This is most likely due to the relative 
infrequency of large arrays of function pointers in real- 
world code. A buffer overflow is far more likely to di- 
rectly corrupt a pointer before overwriting an index into 
an array of function pointers. 

Nevertheless, DIFT platforms can provide control 
pointer offset protection by combining our PI-based pol- 
icy with a restricted form of BR-based protection. If BR- 
based protection is only used to protect control point- 
ers, then the false positive issues described in Section 2.2 
do not occur in practice [10]. To verify this, we imple- 
mented a control pointer-only BR policy and applied the 
policy to userspace and kernelspace. This policy did not 
result in any false positives, and prevented buffer over- 
flow attacks on control pointers. Our policy classified 
and instructions and all comparisons as bounds checks. 

The BR policy has false negatives in different situa- 
tions than the PI policy. Hence the two policies are com- 
plementary. If control-pointer-only BR protection and 
PI protection are used concurrently, then a false negative 
would have to occur in both policies for a control pointer 
offset attack to succeed. The attacker would have to find 
a vulnerability that allowed a control pointer offset to be 
corrupted without corrupting a pointer. The application 
would then have to perform a comparison instruction or 
an and instruction that was not a real bounds check on 
the corrupt offset before using it. We believe this is very 
unlikely to occur in practice. As we have observed no 
false positives in either of these policies, even in ker- 
nelspace, we believe these policies should be run con- 
currently for additional protection. 


6.3 Protecting Offsets for Data Pointers 


Unfortunately, the BR policy cannot be applied to data 
pointer offsets due to the severe false positive issues dis- 
cussed in Section 1. However, specific situations may 
allow for DIFT-based protection of non-pointer data. For 
example, Red Zone heap protection prevents heap buffer 
overflows by placing a canary or special DIFT tag at the 
beginning of each heap chunk [37, 39]. This prevents 
heap buffer overflows from overwriting the next chunk 
on the heap and also protects critical heap metadata such 
as heap object sizes. 

Red Zone protection can be implemented by using 
DIFT to tag heap metadata with a sandboxing bit. Ac- 
cess to memory with the sandboxing bit set is forbidden, 
but sandboxing checks are temporarily disabled when 
malloc() is invoked. A modified malloc () is nec- 
essary to maintain the sandboxing bit, setting it for newly 
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created heap metadata and clearing it when a heap meta- 
data block is freed. The DIFT Red Zone heap protection 
could be run concurrently with PI protection, providing 
enhanced protection for non-pointer data on the heap. 
We implemented a version of Red Zone protection that 
forbids heap metadata from being overwritten, but allows 
out-of-bounds reads. We then applied this policy to both 
glibc malloc() in userspace and the Linux slab allocator 
in kernelspace. No false positives were encountered dur- 
ing any of our stress tests, and we verified that all of our 
heap exploits from our userspace and kernelspace secu- 
rity experiments were detected by the Red Zone policy. 


6.4 Beyond Pointer Corruption 


Not all memory corruption attacks rely on pointer or 
pointer offset corruption. For example, some classes of 
format string attacks use only untainted pointers and in- 
tegers [12]. While these attacks are rare, we should still 
strive to prevent them. Previous work on the Raksha sys- 
tem provides comprehensive protection against format 
string attacks using a DIFT policy [13]. The policy uses 
the same taint information as our PI buffer overflow pro- 
tection. All calls to the printf () family of functions 
are interposed on by the security monitor, which veri- 
fies that the format string does not contain tainted format 
string specifiers such as %n. 

For the most effective memory corruption protection 
for unmodified binaries, DIFT platforms such as Rak- 
sha should concurrently enable PI protection, control- 
pointer-only BR protection, format string protection, and 
Red Zone heap protection. This would prevent pointer 
and control pointer offset corruption and provide com- 
plete protection against format string and heap buffer 
overflow attacks. We can support all these policies con- 
currently using the four tag bits provided by the Raksha 
hardware. The P bit and T bit are used for buffer overflow 
protection and the T bit is also used to track tainted data 
for format string protection. The sandboxing bit, which 
prevents stores or code execution from tagged memory 
locations, is used to protect heap metadata for Red Zone 
bounds checking, to interpose on calls to the printf () 
functions, and to protect the security monitor (see Sec- 
tion 3.1). Finally, the fourth tag bit is used for control- 
pointer-only BR protection. 


7 Conclusions 


We presented a robust technique for buffer overflow pro- 
tection using DIFT to prevent pointer overwrites. In con- 
trast to previous work, our security policy works with un- 
modified binaries without false positives, prevents both 
data and code pointer corruption, and allows for practi- 
cal hardware support. Moreover, this is the first secu- 


USENIX Association 


rity policy that provides robust buffer overflow preven- 
tion for the kernel and dynamically detects user/kernel 
pointer dereferences. 

To demonstrate our proposed technique, we imple- 
mented a full-system prototype that includes hardware 
support for DIFT and a software monitor that manages 
the security policy. The resulting system is a full Gentoo 
Linux workstation. We show that our prototype prevents 
buffer overflow attacks on applications and the operating 
system kernel without false positives and has an insignif- 
icant effect on performance. The full-system prototyp- 
ing approach was critical in identifying and addressing 
a number of practical issues that arise in large user pro- 
grams and in the kernel code. 

There are several opportunities for future research. We 
plan to experiment further with the concurrent use of 
complementary policies that prevent overwrites of non- 
pointer data (see Section 6). Another promising direc- 
tion is applying taint rules to system call arguments. Per- 
application system call rules could be learned automati- 
cally, restricting tainted arguments to security-sensitive 
system calls such as opening files and executing pro- 
grams. 
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